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SUMMARY 

The  overall  objective  of  the  research  carried  out  over  the  last  three  years  was  the 
development  of  new  algorithms  for  the  efficient  simulation  of  viscous  compressible  flows 
with  moving  bodies  in  three  dimensions  using  unstructured  grids.  The  development  was 
based  on  current  3-D  Euler/Navier-Stokes  capabilities,  and  encompassed  flow  solvers, 
grid  generation,  fluid-structure  interaction,  the  efficient  use  of  supercomputer  hardware, 
and  new  visualization  capabilities.  The  research  carried  out  over  the  last  three  years 
significantly  advanced  the  state  of  the  art  in  this  area  of  CFD.  The  particular  topics 
are  treated  below  in  detail. 
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1.  FLOW  SOLVERS 

For  the  flow  solvers,  seven  major  developments  took  place  over  the  course  of  this  re¬ 
search  effort: 

a)  Implicit  flow  solvers; 

b)  Better  mesh  moving  strategies; 

c)  Implementation  of  turbulence  models;  and 

d)  Validation  studies. 

1.1  Implicit  Flow  Solvers 

Implicit  flow  solvers  are  considered  essential  for  the  efficient  simulation  of  viscous,  com¬ 
pressible,  time-dependent  flows.  We  developed  a  linearized  implicit  scheme  that  uses  a 
Generalized  Minimal  RESiduals  algorithm  in  conjunction  with  incomplete  lower-upper 
(ILU)  preconditioning  for  the  solution  of  the  Euler  and  Navier-Stokes  equations.  The 
results  were  encouraging,  showing  that  for  Euler  problems  steady  state  results  could 
be  achieved  in  less  than  40  steps.  On  the  other  hand,  the  storage  costs  and  the  cost  of 
getting  close  to  the  solution  at  the  start  of  the  iteration  were  considered  suboptimal. 
This  led  to  development  of  matrix-free  preconditioners,  in  particular  GMRES-LU-SGS. 
The  Euler/Navier-Stokes  equations  may  be  written  as  a  system  of  conservation  laws  of 
the  form: 


or,  in  integral  form,  as: 


U(  +  V  •  F  =  0  , 


^  J  u d£l  +  j  F  •  ndF  =  0 

The  set  of  equations  resulting  from  spatial  discretization  using  an  edge-based  data 
structure  may  be  written  as: 

Miu;<  =  r'=£ci)  'F« .  (i) 

where  M,C,  u  and  F  denote,  respectively,  the  lumped  mass-matrix  (volume),  geom¬ 
etry  (shape-function  derivative)  factor,  unknowns  and  fluxes.  Work  continued  on  the 
GMRES-LU-SGS  scheme.  This  scheme  is  based  on  the  vectorized  LU-SGS  scheme 
first  proposed  in  [2].  The  implicit  time  integration  of  Eqn.(l)  using  a  backward  Euler 
scheme  with  linearization  yields: 


Aul 


Denoting  by  pA  the  spectral  radius  of  the  Jacobian  A  and  defining: 


D  = 


I  ,  AF  =  F(u  +  Au)  -  F(u)  , 


(2) 


(3) 
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this  system  of  equations  is  solved  by  the  following  relaxation  procedure: 

a)  Forward  Sweep: 

Au*  =  D"1  |r*  -  0.5  Y  Cij  •  (AF0-  -  pAAuj) 

b)  Backward  Sweep: 


(4a) 


Au*  =  Au*  -  0.5D-1  Y  Cij  ■  ( AFy  -  pA Au  Ctj  •  (AF0-  -  pA Au,-)  (46) 

^ ) 

We  have  shown  in  [2,5,6]  that  this  scheme  is  an  excellent  preconditioner  for  the  Gener¬ 
alized  Minimal  RESiduals  (GMRES)  iterative  solver  that  is  by  now  firmly  established 
in  CFD.  This  scheme  was  extended  to  transient  problems  (time-accurate  solvers  [6]) 
and  was  investigated  further  in  two  directions,  namely:  robustness  and  parallelism. 
The  investigations  with  respect  to  the  robustness  of  the  GMRES-LU-SGS  showed  that 
the  scheme  was  remarkably  insensitive  to  point  and  edge  numbering.  Parallel  schemes 
were  investigated  for  shared-memory  machines.  It  was  found  that  the  GMRES-LU- 
SGS  parallelizes  well  up  to  24  processors,  the  maximum  number  avaible  to  us  at  the 
present  time.  For  technical  details  the  reader  is  referred  to  [2],  which  is  reproduced  in 
Appendix  1. 

1.2  Better  Mesh  Moving  Strategies 

A  Laplacian  smoothing  of  the  mesh  velocities  with  variable  diffusivity  based  on  the 
distance  from  moving  bodies  was  developed  [11].  This  variable  diffusivity  enforces  a 
more  uniform  mesh  velocity  in  the  region  close  to  the  moving  bodies.  Given  that  in 
most  applications  these  are  regions  where  small  elements  are  located,  the  new  proce¬ 
dure  decreases  considerably  element  distortion,  reducing  the  need  for  local  or  global 
remeshing,  and  in  some  cases  avoiding  it  alltogether.  A  hypersonic  store  release  was 
used  to  test  the  new  algorithm.  Numerical  results  obtained  show  that  the  new  mesh 
velocity  smoothing  leads  to  a  much  less  deformed  grid  close  to  the  moving  missile.  For 
this  case,  the  number  of  local  remeshings  required  dropped  by  a  factor  of  1:4,  leading 
to  considerable  CPU  savings  in  multiprocessor  environment.  Since  then,  this  algorithm 
has  been  used  extensively  for  many  applications  [9-12].  For  technical  details  the  reader 
is  referred  to  [11]. 

1.3  Implementation  of  Turbulence  Models 

We  continued  with  the  implementation  of  several  turbulence  models.  Some  of  these 
have  very  stiff  source  or  production  terms  that  require  careful  numerical  treatment. 
For  technical  details  the  reader  is  referred  to  [1,8]. 
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1.4  Validation  Studies 


The  efficiency  and  fidelity  of  the  new  Arbitrary  Lagrangian-Eulerian  (ALE)  methodol¬ 
ogy  on  unstructured  grids  was  validated  by  two  simulations.  This  validation  effort  was 
part  of  an  ongoing  research  effort  to  develop  a  cost-efficient  and  accurate  numerical 
methodology  capable  of  simulating  the  motion  of  complex-geometry,  three-dimensional 
bodies  embedded  in  external,  temporally  and  spatially  evolving  flow-fields. 

The  first  computation  modeled  the  separation  of  a  fuel  tank  from  an  F-16  C/D  fighter. 
The  numerical  Eulerian  predictions  were  compared  to  the  available  experimental  data. 
First,  a  series  of  steady-  state  runs  were  performed  to  compare  overall  loads  and  mo¬ 
ments.  Then,  the  fuel  tank  was  released,  with  proper  modeling  of  the  initial  impulse 
due  to  explosive  release.  Very  good  agreement  was  obtained  between  the  predicted  and 
measured  fuel  tank  trajectory  [10]. 

The  second  simulation  modeled  canopy  trajectory  for  an  F-18  fighter.  As  before,  very 
good  agreement  was  obtained  between  the  predicted  and  measured  fuel  tank  trajectory 
[12]- 


2.  GRID  GENERATION 

In  the  area  of  grid  generation,  there  were  three  major  developments  that  took  place 
during  the  course  of  this  research  effort: 

a)  Surface  meshing  from  discrete  data; 

b)  Parallel  grid  generation;  and 

c)  Navier-Stokes  gridding. 

2.1  Surface  Meshing  from  Discrete  Data 

An  advancing  front  surface  gridding  technique  that  operates  on  discretely  defined  faces 
was  developed  [29].  This  technique  is  based  on  three  steps:  surface  feature  recovery, 
actual  gridding,  and  surface  recovery.  The  following  aspects  have  to  be  considered 
carefully  in  order  to  make  the  precedure  reliable  for  complex  geometries: 

a)  Recovery  of  surface  features  and  discrete  surface  patches  from  the  discrete  data, 

b)  Filtering  based  on  point  and  side  normals  to  remove  undesirable  data  close  to 
cusps  and  corners, 

c)  Proper  choice  of  host  faces  for  ridges,  and 

d)  Fast  interpolation  procedures  suitable  for  complex  geometry. 

Several  examples  ranging  from  academic  to  industrial  demonstrated  the  utility  of  the 
developed  procedure  for  ab  initio  surface  meshing  from  discrete  data,  such  as  encoun¬ 
tered  when  the  surface  description  is  already  given  as  discrete,  the  improvement  of 
existing  surface  triangulations,  as  well  as  remeshing  applications  during  runs  exhibit¬ 
ing  significant  change  of  domain.  For  technical  details  the  reader  is  referred  to  [29]. 
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2.2  Parallel  Grid  Generation 


FY99  saw  the  development,  coding  and  debugging  of  a  new  parallel  advancing  front 
grid  generation  algorithm.  We  consider  this  to  be  truly  a  breakthrough,  as  it  allows 
the  unstructured  grid  ALE  moving  body  methodology  to  be  extended  fully  to  parallel 
machines.  To  date,  the  required  remeshing  that  always  appears  for  moving  mesh  sim¬ 
ulations  with  complex  body  motion  had  to  be  carried  out  in  scalar  mode.  If  we  define 
as  dmin  the  minimum  element  size  in  the  active  front,  and  as  smjn  the  minimum  box 
size  in  which  elements  are  to  be  generated,  the  parallel  advancing  front  grid  generator 
proceeds  as  follows: 

WHILE:  There  are  active  faces  left: 

-  Form  an  octree  with  minimum  octant  size  smin  for  the  active  points; 

-  Retain  the  octants  that  have  faces  that  will  generate  elements  of  size  dmtn  to 

Q  '  dmin ; 

-  If  too  many  octants  are  left:  agglomerate  them  into  boxes; 

-  DO  ISHFT=0,2: 

-  IF:  ISHFT.NE.O: 

Shift  the  boxes  by  a  preset  amount; 

-  END IF 

-  Generate,  in  parallel,  elements  in  these  boxes,  allowing  only  elements  up  to  a 
size  of  c\  *  dffiifi ; 

-  ENDDO 

Increase  dmin  =  1.5  *  dmini  sm{n  =  1.5  *  3min] 

ENDWHILE 

The  increase  factor  allowed  is  typically  in  the  range  c/  =  1.5  -  2.0.  Major  areas  of  work 
included: 
noi 

-  Treatment  of  cases  with  large  variation  of  element  size; 

-  Proper  estimation  of  work  per  domain/processor; 

-  Load  balancing;  and 

-  Reduction  of  inter-box  faces. 

The  parallel  mesh  generation  technique  was  debugged  and  improved  during  FY99,  and 
has  now  been  incorporated  into  the  production  code  FEFL098.  The  speedups  obtained 
for  grid  generation  are  now  comparable  to  those  of  the  CFD  solver.  For  more  details 
on  the  parallel  grid  generator,  see  [23],  which  is  reproduced  in  Appendix  2. 
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2.3  Navier-Stokes  Gridding 

Creating  highly  stretched  grids  of  acceptable  quality  for  complex  configurations  has 
been  an  outstanding  goal  for  over  two  decades.  During  FY98  and  FY99  we  continued 
the  development  of  a  new  RANS  gridding  algorithm  that  was  conceived  in  FY97.  The 
key  steps  of  this  algorithm  may  be  summarized  as  follows: 

a)  Generate  first  an  isotropic  mesh. 

b)  Using  a  constrained  Delaunay  technique,  introduce  points  in  order  to  generate 
highly  stretched  elements. 

c)  Introduce  the  points  in  ascending  level  of  stretching,  i.e.  from  the  domain  interior 
to  the  boundary  (or  interior  boundaries). 

This  procedure  has  the  following  advantages: 

-  No  surface  recovery  is  required  for  the  Delaunay  reconnection,  eliminating  the 
most  problematic  part  of  this  technique; 

-  Proper  meshing  at  concave  ridges/corners  is  obtained; 

-  The  meshing  of  concave  ridges/corners  requires  no  extra  work; 

-  Meshing  problems  due  to  surface  curvature  are  minimized; 

-  In  principle,  no  CAD  representation  of  the  surface  is  required;  and 

-  A  final  mesh  is  guaranteed,  an  essential  requirement  for  industry. 

The  disadvantages  are  the  following: 

-  As  with  any  Delaunay  technique,  the  mesh  quality  is  extremely  sensitive  to  point 
placement. 

Major  areas  of  work  included: 

-  Point  removal  techniques  in  the  pre-RANS-gridding  phase; 

-  Proper  point  placement  to  mitigate  the  possible  negative  effects  of  Delaunay  re¬ 
connection; 

-  Acceptance/rejection  tests  for  new  points,  particularly  for  gaps; 

-  Reconnection  of  surface  faces  for  optimal  meshes;  and 

-  Smooth  transition  of  element  size  and  stretching. 

While  this  is  work  in  progress,  we  feel  confident  that  a  new  level  of  gridding  technology 
has  been  achieved.  For  more  details  the  reader  is  referred  to  [21,22],  which  is  reproduced 
in  Appendix  3. 
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3.  FLUID-STRUCTURE-THERMAL  COUPLING 


In  the  area  of  interdisciplinary  couling,  there  were  three  major  developments  that  took 
place  during  the  course  of  this  research  effort: 

a)  Loose  coupling  strategies  for  fluid-structure-thermal  problems; 

b)  Link  to  FEEIGEN,  NASTRAN  and  DYNA3D;  and 

c)  Breakup  and  topology  change. 

3.1  Loose  Coupling  Strategies 

In  order  to  solve,  in  a  cost-effective  manner,  fluid-structure-thermal  interaction  prob¬ 
lems,  a  loosely  coupled  algorithm  to  combine  computational  fluid  dynamics  (CFD), 
computational  structural  dynamics  (CSD)  and  computational  thermo-dynamic  (CTD) 
codes  was  devised.  In  this  algorithm,  the  structure  is  used  as  the  ’master-surface’ 
to  define  the  extent  of  the  fluid  region,  and  the  fluid  is  used  as  the  ’master-surface’ 
to  define  the  loads.  The  transfer  of  loads,  heat  fluxes,  displacements,  velocities  and 
temperatures  is  carried  out  via  fast  interpolation  and  projection  algorithms.  This 
fluid-structure-thermal  algorithm  can  be  interpreted  as  an  iterative  solution  to  the 
fully  coupled,  large  matrix  problem  that  results  from  the  discretization  of  the  complete 
problem.  The  advantage  of  this  new  algorithm  is  that  it  allows  a  cost  effective  re-use 
of  existing  software,  with  minimum  amount  of  alterations  required  to  account  for  the 
interaction  of  the  different  media. 

Several  example  runs  using  FEFL098  as  the  CFD  code,  and  NASTRAN  as  the 
CSD/CTD  code,  demonstrate  the  effectiveness  of  the  proposed  methodology.  For  more 
details,  see  the  AIAA  invited  paper,  Ref.  [15],  which  is  reproduced  in  Appendix  4,  as 
well  as  [13-19]. 

3.2  Link  to  FEEIGEN.  NASTRAN  and  DYNA3D 

The  CFD  code  was  linked  to  three  different  structural  solvers  that  span  a  large  range 
of  applications: 

a)  FEEIGEN  is  an  eigenmode  integrator  with  adaptive  timestep.  FEEIGEN  can  read 
eigenmode  information  from  a  variety  of  sources,  including  such  well-known  codes  as 
NASTRAN  and  ANSYS.  It  then  integrates  the  eigenmodes  in  time,  thus  producing  the 
structural  response. 

b)  NASTRAN  is  a  general-purpose  linear  Finite  Element  code  used  in  the  present  con¬ 
text  for  elasticity  and  thermal  problems.  The  particular  version  used  is  the  COSMIC- 
NASTRAN  code,  as  it  offered  the  possibility  of  accesing  the  source  code. 

c)  DYNA3D  is  a  general-purpose  nonlinear  Finite  Element  code  used  in  the  present 
context  for  impact  and  shock-structure  interaction  simulation.  The  particular  sed  is 
the  LLNL  public- domain  DYNA3D,  as  it  offered  the  possibility  of  accesing  the  source 
code. 
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3.3  Breakup  and  Topology  Change 


The  loosely  coupled  algorithm  was  extended  to  include  breakup  and  topology  change. 
Suppose  that  due  to  cracking,  failure,  spalation,  etc.,  the  ‘wetted  surface’  of  the  CSD 
domain  has  been  changed.  This  new  surface,  given  by  a  list  of  points  and  faces,  has  to 
be  matched  with  a  corresponding  CFD  surface.  The  CFD  surface  data  typically  consists 
of  surface  segments  defined  by  analytical  functions  that  do  not  change  in  time  (such  as 
exterior  walls,  farfield  boundaries,  etc.),  and  surface  segments  defined  by  triangulations 
(i.e.  discrete  data)  that  change  in  time.  These  triangulations  are  obtained  from  the 
‘wetted  CSD  surface’  at  every  timestep.  When  a  change  in  topology  is  detected,  the 
new  surface  definition  is  recovered  from  the  discrete  data,  and  joined  to  the  surfaces 
defined  analytically.  The  discrete  surface  is  defined  by  a  support  triangulation,  with 
lines  and  end-points  to  delimit  its  boundaries.  In  this  sense,  the  only  difference  with 
analytically  defined  surfaces  is  the  (discrete)  support  triangulation.  The  patches,  lines 
and  end-points  of  the  wetted  CSD  surface’  are  identified  by  comparing  the  unit  surface 
normals  of  adjacent  faces.  If  the  scalar  product  of  them  lies  below  a  certain  tolerance, 
a  ridge  is  defined.  Corners  are  defined  as  points  that  are  attached  to: 

-  Only  one  ridge; 

-  More  than  two  ridges;  or 

-  Two  ridges  with  considerable  deviation  of  unit  side- vector. 

Between  corners,  the  ridges  form  discrete  lines.  These  discrete  lines  either  separate 
or  are  embedded  completely  (i.e.  used  twice)  in  discrete  surface  patches  [19].  For 
the  old  surface  definition  data  set,  the  surface  patches  attached  to  wetted  CSD  surfaces 
are  identified  and  all  information  associated  with  them  is  discarded.  The  remaining 
data  is  then  joined  to  the  new  wetted  CSD  surface  data,  producing  the  updated  surface 
definition  data  set.  This  data  set  is  then  used  to  generate  the  new  surface  and  volume 
grids. 

The  surface  reconstruction  procedure  may  be  summarized  as  follows: 

-  For  the  Updated  Discrete  Data,  Obtain: 

-  Surface  Patches  +  B.C. 

-  Lines 

-  End-Points 

-  Sources 

-  For  the  Old  Analytical+Discrete  Data: 

-  Remove  Discrete  Data 

-  Reorder  Arrays 

-  Merge: 

-  Old  Analytical  Data 

-  Updated  Discrete  Data 

Once  a  new  mesh  has  been  generated,  the  solution  from  the  previous  timestep  (on 
the  previous  mesh)  has  to  be  interpolated.  Optimal  interpolation  algorithms  for  un¬ 
structured  grids  were  developed  in  a  previous  AFOSR-sponsored  effort.  Whenever  new 
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fluid  domains  are  created  due  to  failure,  cracking  and  spalation,  interpolating  the  fluid 
solution  from  the  previous  timestep  to  these  new  domains  will  end  in  failure,  as  there 
are  no  possible  host  elements  in  the  old  mesh.  It  is  therefore  important  to  identify 
points  of  the  new  mesh  that  lie  outside  the  confines  of  the  old  mesh.  A  new  way  that 
was  developed  and  that  has  proven  successful  is  to  form  a  Cartesian  mesh  or  bins.  A 
loop  is  then  performed  over  the  elements  of  the  old  mesh,  marking  the  bins  covered  by 
elements.  In  a  second  loop  over  the  points  of  the  new  grid,  all  points  that  fall  into  bins 
not  covered  by  the  old  grid  are  marked  as  impossible  to  interpolate.  This  procedure 
can  be  done  recursively  by  obtaining  the  confines  of  the  volume  where  points  have  been 
marked  as  impossible,  leading  to  so-called  ‘telescoping’  of  the  bin  search  region.  No 
attempt  is  then  made  to  interpolate  the  points  marked  as  outside  the  old  mesh.  The 
unknowns  for  these  points  can  be  extrapolated  using  different  procedures: 

-  Advancing  layers  (most  often  used  for  subsonic/isotropic  flows); 

-  Upstream  (used  primarily  for  supersonic  flows); 

-  Closest  known  point  (for  cracks);  or 

-  Via  user- prescribed  subroutine  (for  special  cases). 

For  technical  details  the  reader  is  referred  to  [19],  which  is  reproduced  in  Appendix  5. 

4.  EFFICIENT  USE  OF  SUPERCOMPUTING  HARDWARE 

During  the  present  effort  we  parallelized  the  majority  of  the  ‘core’  subroutines  required 
by  the  flow  solver  for  shared  memory,  parallel  machines.  In  order  to  appreciate  the 
task  at  hand,  consider  the  following  Laplacian  right-hand-side  loop  in  its  original  form: 
do  1600  iedge=l ,nedge 
ipoi l=lnoed ( 1 , iedge) 
ipoi2=lnoed(2, iedge) 

redge=geoed (  iedge) * (unkno (ipoi2) -unkno (ipoil) ) 
rhspo ( ipoi 1 ) =rhspo ( ipo i 1 ) +redge 
rhspo(ipoi2)=rhspo(ipoi2)-redge 
1600  continue 

Here  nedge,lnoed,geoed, unkno, rhspo  denote,  respectively,  the  number  of  edges, 
edge-point  connectivity,  geometrical  and  physical  edge-parameters,  unknowns  at  points 
and  the  right-hand-side  vector  at  points.  In  order  to  run  well  on  a  shared  memory, 
cache-and  RISC-based  (i.e.  pipelined)  parallel  machine,  it  has  to  be  re-written  as: 
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do  1000  imacg=l,npasg,nproc 
imac0=  imacg 

imacl=min(npasg,imacO+nproc-l) 
c$doacross  local (ipasg,ipass,npasO,npasl, iedge, nedgO,nedgl , 

ipoil,ipoi2,redge)  !  Parallelization  directive 

do  1200  ipasg=imacO , imacl 
npasO=edpag(ipasg)+l 
npasl=edpag(ipasg+l) 
do  1400  ipass=npas0,npasl 
nedg0=edpas (ipass) +1 
nedgl=edpas(ipass+l) 

c$dir  ivdep  !  Pipelining  directive 

do  1600  iedge=nedg0,nedgl 
ipoil=lnoed(l, iedge) 
ipoi2=lnoed(2 , iedge) 

redge=geoed(  iedge) * (unkno (ipoi2) -unkno (ipoil) ) 
rhspo ( ipo i 1 ) =rhspo (ipoil) +redge 
rhspo (ipoi2) =rhspo (ipoi2) -redge 
1600  continue 
1400  continue 
1200  continue 
1000  continue 

where  nproc  denotes  the  nr.  of  processors.  Approximately  100  subroutines  were  re¬ 
written  during  this  period  of  time.  For  technical  details,  see  [24,25,26].  Figure  1  shows 
the  speedups  obtained  for  a  sueprsonic  inlet  flow  on  a  number  of  platforms.  Scalability 
is  evident. 


1  2  4  8  16  32 

Nr.  of  Processors 


Euler,  500Ktet,  RK3,  Roe+MUSCL 
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5.  POST-PROCESSING 


In  order  to  handle  the  very  large  amounts  of  output  data  produced  by  current  CFD 
runs  with  moving  bodies,  a  new  post-processing  tool,  called  ZFEM,  was  developed. 
The  following  design  requirements  were  set  out  for  the  new  visualization  system: 

-  Continuous  (Grid-Based)  and  Discrete  (Particle-Based)  Data 

-  Distributed/Remote  Reading 

-  Distributed/Remote  Display 

-  Collaborative/Linked  Display 

-  Multidisciplinary 

-  On-Line  Display  With  Loose  Coupling 

-  Interactive 

-  Repeatable  (Movie-Making) 

-  Steering 

-  Portable 

-  User  Friendly 

-  Minimum  Storage 

-  Maximum  Speed 

A  new  visualization  system  was  devised,  whose  core  ideas  are  summarized  below: 

-  Master /Worker /Viewer/ Signaler  Task  Assignment 

-  Link  Workers/Viewers  on  Different  Hosts 

-  Interaction  With  Solver  via  Signal  Files 

-  Script  System  (Movie-Making) 

-  Standard  Libraries:  OpenGL,  Motif,  PVM 

-  C  Language 

-  Optimized  Data  Structures  for  Display 

-  GUI  +  Help  System  (HTML) 

Two  novel  aspects,  that  set  this  new  visualizer  apart  from  anything  hitherto  developed, 
are: 


-  The  parallel  visualization  of  distributed  data; 

-  The  concurrent,  collaborative  visualization  of  data  across  the  internet.  This  last 
aspect  has  made  collaborative  assignments  across  geographically  distant  locations 
a  reality; 

-  The  ability  to  connect,  remotely,  to  a  running  flow  solver  for  on-line  visualization 
without  disturbing  its  performance. 

This  new  visualization  system  is  still  evolving.  For  more  information  on  its  current  sta¬ 
tus,  see  the  web-page:  http://www.science.gmu.edu/~jcebral/pages/zfem.html. 
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6.  SUMMARY 


The  present  research  effort  significantly  advanced  the  state  of  the  art  in  the  simulation 
of  compressible  viscous  flows  with  moving  bodies.  A  number  of  breakthroughs  and 
‘firsts’  were  achieved,  of  which  the  following  are  considered  the  most  important: 

-  The  first  fast  matrix-free  implicit  scheme  for  unstructured  grids  (GMRES-LU-SGS) 

[2]; 

-  The  first  scalable,  shared-memory,  unstructured-grid  flow  solver  [25]; 

-  The  first  conservative  load  transfer  algorithm  for  fluid-structure  interaction  simu¬ 
lations  [13]; 

-  The  first  simulation  of  compressible  flows  with  more  than  a  thousand  indepen¬ 
dently  moving  bodies  [19]; 

-  The  most  complex  fluid-structure  interaction  simulation  to  date  [16];  and 

-  The  first  fluid-structure  interaction  simulation  capability  for  accurate  fragmenta¬ 
tion  and  cracking  [19]. 

More  work  is  still  required  to  transform  these  algorithms  into  daily  production  tools 
that  can  be  used  effectively  in  the  design  and  engineering  process. 
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A  fast,  matrix-free  implicit  method  has  been  developed  to  solve  the  three-dimen¬ 
sional  compressible  Euler  and  Navier-Stokes  equations  on  unstructured  meshes. 
An  approximate  system  of  linear  equations  arising  from  the  Newton  linearization 
is  solved  by  the  GMRES  (generalized  minimum  residual)  algorithm  with  a  LU- 
•  SGS  (lower-upper  symmetric  Gauss-Seidel)  preconditioner.  A  remarkable  feature 
of  the  present  GMRES+LU-SGS  method  is  that  the  storage  of  the  Jacobian  matrix 
can  be  completely  eliminated  by  approximating  the  Jacobian  with  numerical  fluxes, 
resulting  in  a  matrix-free  implicit  method.  The  method  developed  has  been  used  to 
compute  the  compressible  flows  around  3D  complex  aerodynamic  configurations  for 
a  wide  range  of  flow  conditions,  from  subsonic  to  supersonic.  The  numerical  results 
obtained  indicate  that  the  use  of  the  GMRES+LU-SGS  method  leads  to  a  significant 
increase  in  performance  over  the  best  current  implicit  methods,  GMRES+ILU  and 
LU-SGS,  while  maintaining  memory  requirements  similar  to  its  explicit  counterpart. 
An  overall  speedup  factor  from  eight  to  more  than  one  order  of  magnitude  for  all  test 
cases  in  comparison  with  the  explicit  method  is  demonstrated.  ©  1998  Academic  Press 


1.  INTRODUCTION 

The  use  of  unstructured  meshes  for  computational  fluid  dynamics  problems  has  become 
widespread  due  to  their  ability  to  discretize  arbitrarily  complex  geometries  and  due  to 
the  ease  of  adaptation  in  enhancing  the  solution  accuracy  and  efficiency  through  the  use 
of  adaptive  refinement  techniques.  In  recent  years,  significant  progress  has  been  made  in 
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developing  numerical  algorithms  for  the  solution  of  the  compressible  Euler  and  Navier- 
Stokes  equations  on  unstructured  grids. 

Early  efforts  in  the  development  of  temporal  discretization  methods  using  unstructured 
grids  focused  on  explicit  schemes.  Usually,  explicit  temporal  discretizations  such  as  mul¬ 
tistage  Runge-Kutta  schemes  are  used  to  drive  the  solution  to  steady  state.  Acceleration 
techniques  such  as  local  time-stepping  and  implicit  residual  smoothing  have  also  been  com¬ 
bined  in  this  context.  In  general,  explicit  schemes  and  their  boundary  conditions  are  easy  to 
implement,  vectorize  and  parallelize,  and  require  only  limited  memory  storage.  However, 
for  large-scale  problems  and  especially  for  the  solution  of  the  Navier-Stokes  equations,  the 
rate  of  convergence  slows  down  dramatically,  resulting  in  inefficient  solution  techniques. 
In  order  to  speed  up  convergence,  a  multigrid  strategy  or  an  implicit  temporal  discretization 
is  required. 

In  general,  implicit  methods  require  the  solution  of  a  linear  system  of  equations  arising 
from  the  linearization  of  a  fully  implicit  scheme  at  each  time  step  or  iteration.  The  most 
widely  used  methods  to  solve  a  linear  system  on  unstructured  grids  are  iterative  solution 
methods  and  approximate  factorization  methods.  Significant  efforts  have  been  made  to 
develop  efficient  iterative  solution  methods.  These  range  from  Gauss-Seidel  to  Krylov  sub¬ 
space  methods  that  use  a  wide  variety  of  preconditioners  (see,  e.g.,  Stoufflet  [1],  Batina  [2], 
Venkatakrishnan  etal.  [3],  Knight  [4],  Whitaker  [5],  Luo  etal  [6],  and  Barth  et al.  [7]).  The 
most  successful  and  effective  iterative  method  is  to  use  the  Krylov  subspace  methods  [8, 9] 
such  as  GMRES  and  BICGSTAB  with  an  ILU  (incomplete  lower-upper)  factorization 
preconditioner.  The  drawback  is  that  they  require  a  considerablece  amount  of  memory 
to  store  the  Jacobian  matrix,  which  may  be  prohibitive  for  large  problems.  Recently,  the 
lower-upper  symmetric  Gauss-Seidel  method  developed  first  by  Jameson  and  Yoon  [10] 
on  structured  grids  has  been  successfully  generalized  and  extended  to  unstructured  meshes 
by  several  authors  [11-13].  The  most  attractive  feature  of  this  approximate  factorization 
method  is  that  the  evaluation  and  storage  of  the  Jacobian  matrix  inherent  in  the  original 
formulation  of  the  LU-SGS  method  can  be  completely  eliminated  by  making  some  approx¬ 
imations  to  the  implicit  operator.  The  resulting  LU-SGS  method  can  be  made  even  cheaper 
than  the  explicit  method  per  time  step.  However,  this  method  is  less  effective  than  the  most 
efficient  iterative  methods  such  as  GMRES+ILU,  because  of  slow  convergence,  requiring 
thousands  of  time  steps  to  achieve  a  steady  state. 

The  objective  of  the  effort  discussed  in  this  paper  is  to  develop  a  fast  implicit  method  for 
solving  compressible  flow  problems  around  3D  complex,  realistic  aerodynamic  configura¬ 
tions  on  unstructured  grids.  Typically,  hundreds  of  thousands  of  mesh  points  are  necessary 
to  represent  such  engineering-type  configurations  accurately.  Any  implicit  methods  requir¬ 
ing  the  storage  of  the  Jacobian  matrix  would  be  impractical,  if  not  impossible  to  use  to  solve 
such  large-scale  problems,  where  the  storage  requirement  can  easily  exceed  the  memory 
limitation  of  present  computers.  In  the  present  work  a  system  of  linear  equations,  aris¬ 
ing  from  an  approximate  linearization  of  a  fully  implicit  temporal  discretization  at  each 
time  step,  is  solved  iteratively  by  a  GMRES  algorithm  with  an  LU-SGS  preconditioner. 
The  idea  behind  this  is  to  combine  the  efficiency  of  the  iterative  methods  and  low  mem¬ 
ory  requirement  of  approximate  factorization  methods  in  an  effort  to  develop  a  fast,  low 
storage  implicit  method.  An  apparent  advantage  of  the  LU-SGS  preconditioner  is  that  it 
uses  the  Jacobian  matrix  of  the  linearized  scheme  as  a  preconditioner  matrix,  as  compared 
with  ILU  preconditioner  and,  consequently,  does  not  require  any  additional  memory  stor¬ 
age  and  computational  effort  to  store  and  compute  the  preconditioner  matrix.  Furthermore, 
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the  storage  of  the  approximate  Jacobian  matrix  can  be  completely  eliminated  by  approxi¬ 
mating  the  Jacobian  with  numerical  fluxes,  which  will  lead  to  a  fast,  low-storage  implicit 
algorithm.  The  matrix-free  implicit  method  developed  has  been  used  to  compute  a  wide 
range  of  test  problems  and  has  been  compared  with  a  well-known  GMRES+ILU  algorithm 
and  an  approximately  factored  implicit  algorithm  LU-SGS.  The  new  algorithm  is  found  to 
offer  substantial  CPU  time  savings  over  the  best  current  implicit  methods,  while  maintaining 
a  memory  requirement  competitive  with  the  explicit  method. 


2.  GOVERNING  EQUATIONS 

The  Navier-Stokes  equations  governing  unsteady  compressible  viscous  flows  can  be 
expressed  in  the  conservative  form  as 


au  8F>  _  dGj 

dt  +  dXj  dXj  ’ 


(2.1) 


where  the  summation  convention  has  been  employed.  The  unknown  vector  U,  inviscid  flux 
vector  F,  and  viscous  flux  vector  G  are  defined  by 


/  p  \ 

(  pitj  \ 

(  0  \ 

U  =  I  pUi  1 

,  F''  = 

pitiUj  +  pSjj 

.  Gj  = 

(Tij 

\pe  ) 

uj(pe  +  p)  ) 

\ui°U+kE- ) 

Here  p,  p,e,T,  and  k  denote  the  density,  pressure,  specific  total  energy,  temperature,  and 
thermal  conductivity  of  the  fluid,  respectively,  and  w,  is  the  velocity  of  the  flow  in  the 
coordinate  direction  This  set  of  equations  is  completed  by  the  addition  of  the  equation 
of  state, 


P  =  (Y  ~  1  )P 


1 

2  UJUJ 


(2.3) 


which  is  valid  for  perfect  gas,  where  y  is  the  ratio  of  the  specific  heats,  and  Cv  is  the  specific 
heat  at  constant  volume.  The  components  of  the  viscous  stress  tensor  oty  are  given  by 


°ij  =  M 


/  dUi_ 
\dXj 


+  X 


duk  z 
- On . 

dxk 


(2.4) 


The  thermal  conductivity  k  and  viscosity  coefficient  (i  are  assumed  to  be  a  function  of  the 
temperature  and  are  determined  using  Sutherland’s  empirical  relation.  It  is  assumed  that  X 
and  \i  are  related  by  Stokes’  hypothesis 


2  p 

T’ 


(2.5) 


The  left-hand  side  of  Eq.  (2.1)  constitutes  the  Euler  equations  governing  unsteady  com¬ 
pressible  inviscid  flows. 

In  the  sequel,  we  assume  that  Q  is  the  flow  domain,  T  is  its  boundary,  and  tij  is  the  unit 
outward  normal  vector  to  the  boundary. 
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3.  HYBRID  DISCRETE  FORMULATION 

Assuming  is  a  classical  triangulation  of  £2,  N,  is  a  standard  linear  finite  element 
shape  function  associated  with  a  node  /,  and  Cr  is  a  dual  mesh  cell  associated  with  the 
node,  the  hybrid  finite  volume  and  finite  element  formulation  used  for  discretization  of  the 
Navier-Stokes  equations  is 

find  UA  e  Th  such  that  for  each  V/  (1  <  /  <  n) 

FW.Mr=/^!w[!l  <31) 

Jc,  dt  J ac,  Jah  dxj 

where  Th  is  a  discrete  approximation  space  of  suitable  continuous  functions.  Inviscid  fluxes 
are  discretized  using  a  cell-vertex  finite  volume  formulation,  where  the  control  volumes  are 
nonoverlapping  dual  cells  constructed  by  the  median  planes  of  the  tetrahedra.  In  the  present 
study  the  numerical  flux  functions  for  inviscid  fluxes  at  the  dual  mesh  cell  interface  are  com¬ 
puted  using  the  AUSM+  [14]  (advection  upwind  splitting  method)  scheme.  A  MUSCL  [15] 
approach  is  used  to  achieve  high-order  accuracy.  The  Van  Albada  limiter  based  on  primitive 
variables  is  used  to  suppress  the  spurious  oscillation  in  the  vicinity  of  the  discontinuities. 
The  implementation  of  the  precise  MUSCL  strategy  used  in  the  present  work  can  be  found 
in  Ref.  [21].  Viscous  flux  terms  are  evaluated  using  a  linear  finite  element  approximation, 
which  is  equivalent  to  a  second-order  accurate  central  difference. 

Equation  (3.1)  can  then  be  rewritten  in  a  semi-discrete  form  as 


Vi 


3U, 

dt 


R, 


(3.2) 


where  V)  is  the  volume  of  the  dual  mesh  cell  (equivalent  to  the  lumped  mass  matrix  in  the 
finite  element),  and  R  is  the  right-hand  side  residual  and  equals  zero  for  a  steady-state 
solution. 


4.  IMPLICIT  TIME  INTEGRATION 

In  order  to  obtain  a  steady-state  solution,  the  spatially  discretized  Navier-Stokes  equa¬ 
tions  must  be  integrated  in  time.  Using  Euler  implicit  time-integration,  Eq.  (3.2)  can  be 
written  in  discrete  form  as 


*  At  ’ 


(4.1) 


where  At  is  the  time  increment  and  AUrt  is  the  difference  of  an  unknown  vector  between 
time  levels  n  and  n  +  1;  i.e., 


ait  =  u"+1  -u\ 


(4.2) 


Equation  (4.1)  can  be  linearized  in  time  as 
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where  R,  is  the  right-hand  side  residual  and  equals  zero  for  a  steady-state  solution.  Writing 
the  equation  for  all  nodes  leads  to  the  delta  form  of  the  backward  Euler  scheme 


where 


A  AU  =  R, 


A  = 


V  l  3Rn 
Ar1  ~  W 


(4.4) 


(4.5) 


Note  that  as  At  tends  to  infinity,  the  scheme  reduces  to  the  standard  Newton’s  method  for 
solving  a  system  of  nonlinear  equations.  Newton’s  method  is  known  to  have  a  quadratic 
convergence  property.  The  term  3R/3U  represents  symbolically  the  Jacobian  matrix.  It 
involves  the  linearization  of  both  inviscid  and  viscous  flux  vectors.  In  order  to  obtain  the 
quadratic  convergence  of  Newton’s  method,  the  linearization  of  the  numerical  flux  function 
must  be  virtually  exact.  Unfortunately,  explicit  formation  of  the  Jacobian  matrix  resulting 
from  the  exact  linearization  of  any  second-order  numerical  flux  functions  for  inviscid  fluxes 
can  require  excessive  storage  and  is  extremely  expensive,  if  not  impossible  to  evaluate.  In 
order  to  reduce  the  number  of  nonzero  entries  in  the  matrix  and  to  simplify  the  linearization, 
only  a  first-order  representation  of  the  numerical  fluxes  is  linearized.  This  results  in  the  graph 
of  the  sparse  matrix  3R/3U  being  identical  to  the  graph  of  the  supporting  unstructured 
mesh.  In  addition,  the  following  simplified  flux  function  is  used  to  obtain  the  left-hand  side 
Jacobian  matrix, 


Ri(Ui,U  ;,n,7) 


^(F(U,-,  ny)  +  F(U„  n,y))  -  ||X/V|(Uy  -  U,). 


(4.6) 


where 


I*/; I  =  IVy  •  nu\  +  Cu  +  — ^ (4.7) 
Pij\Xj  -*/| 

where  n,y  is  the  unit  vector  normal  to  the  cell  interface,  V,-y  is  the  velocity  vector,  and  C/; 
is  the  speed  of  sound.  Note  that  this  flux  function  is  derived  by  replacing  the  Roe’s  matrix 
by  its  spectral  radius  in  the  well-known  Roe’s  Flux  function  [16], 

R/nv  ^  ^(F(Uf,  n,7)  4-  F(U„  Ujj))  -  -  Uf)  (4.8) 

for  the  inviscid  flux  vector,  and  the  viscous  Jacobian  matrix  is  simply  approximated  by  its 
spectral  radius  in  the  above  linearization  process.  The  linearization  of  flux  function  (4.6) 
yields 


3R;  1 

-  Ml). 


(4.9) 

(4.10) 


where  J  =  3F/3U  represents  the  Jacobian  of  the  inviscid  flux  vector.  The  penalty  for  mak¬ 
ing  these  approximations  in  the  linearization  process  is  that  the  quadratic  convergence  of 
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Newton’s  method  can  no  longer  be  achieved  because  of  the  mismatch  and  inconsistency  be¬ 
tween  the  right-  and  left-hand  sides  in  Eq.  (4.4).  Although  the  number  of  time  steps  (Newton 
iterations,  if  A t  tends  to  infinity)  may  increase,  the  cost  per  each  time  step  is  significantly 
reduced:  it  takes  less  CPU  time  to  compute  the  Jacobian  matrix  and  the  conditioning  of 
the  simplified  Jacobian  matrix  is  improved,  thus  reducing  computational  cost  to  solve  the 
resulting  linear  system. 

As  only  a  first-order  representation  of  the  numerical  fluxes  is  considered,  the  number  of 
nonzero  entries  in  each  row  of  the  matrix  is  related  to  the  number  of  edges  incident  to  the 
node  associated  with  that  row.  In  other  words,  each  edge  i  j  will  guarantee  nonzero  entries  in 
the  fth  column  and  y'th  row  and,  similarly,  the  yth  column  and  ith  row.  In  addition,  nonzero 
entries  will  be  placed  on  the  diagonal  of  the  matrix.  Using  an  edge-based  data  structure,  the 
left-hand  side  Jacobian  matrix  is  stored  in  upper,  lower,  and  diagonal  forms,  which  can  be 
expressed  as 


U  —  ntj)  —  IA.//ID,  (4.11) 

L  =  i(-J(Ul-,njy)-|Xy|D,  (4.12) 

D  =  ^"1  +  yi  — (7(U/,  n ij)  +  |A,^|I).  (4.13) 

j 

Note  that  U,  L,  and  D  represent  the  strict  upper  matrix,  the  strict  lower  matrix,  and 
the  diagonal  matrix,  respectively.  Both  upper  and  lower  matrices  require  a  storage  of 
nedge  x  neqns  x  neqns  and  the  diagonal  matrix  needs  a  storage  of  npoin  x  neqns  x  neqns , 
where  npoin  is  the  number  of  grid  points;  neqns  (=5  in  3D)  is  the  number  of  unknown 
variables  and  nedge  is  the  number  of  edges.  Note  that  in  3D  nedge  npoin.  Clearly,  the 
upper  and  lower  matrix  consume  substantial  amounts  of  memory,  taking  93%  of  the  storage 
required  for  left-hand  side  Jacobian  matrix. 

Equation  (4.4)  represents  a  system  of  linear  simultaneous  algebraic  equations  and  needs 
to  be  solved  at  each  time  step.  The  most  widely  used  methods  to  solve  this  linear  sys¬ 
tem  are  iterative  solution  methods  and  approximate  factorization  methods.  Recently,  the 
lower-upper  symmetric  Gauss-Seidel  method  developed  first  by  Jameson  and  Yoon  [10] 
on  structured  grids  has  been  successfully  generalized  and  extended  to  unstructured  meshes 
by  several  authors  [11-13].  The  LU-SGS  method  is  attractive  because  of  its  good  stability 
properties  and  competitive  computational  cost  in  comparison  to  explicit  methods.  In  this 
method,  the  matrix  A  is  split  in  three  matrices,  a  strict  lower  matrix  L,  a  diagonal  matrix 
Z),  and  a  strict  upper  matrix  U.  This  system  is  approximately  factored  by  neglecting  the 
last  term  on  the  right-hand  side  of  Eq.  (4.14).  The  resulting  equation  can  be  solved  in  the 
two  steps  shown  in  Eqs.  (4. 15)  and  (4. 16),  each  of  them  involving  only  simple  block  matrix 
inversions: 


(D  +  L)D~\D  +  U)  AU  =  R  +  (LD~lU)  AU.  (4.14) 

Lower  (forward)  sweep: 


(D  +  L)AU*  =  R. 


(4.15) 
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Upper  (backward)  sweep: 

(D  +  (/)AU  =  DAU*.  (4.16) 

Both  lower  and  upper  sweeps  can  be  vectorized  by  appropriately  reordering  the  grid 
points  [13],  resulting  in  a  very  efficient  algorithm.  It  is  found  that  the  CPU  cost  of  one 
LU-SGS  step  is  approximately  50%  of  one  three-stage  Runge-Kutta  explicit  step. 

It  is  clear  that  the  above  algorithm  involves  primarily  the  Jacobian  matrix-solution  in¬ 
cremental  vector  product.  Such  operation  can  be  approximately  replaced  by  computing 
increments  of  the  flux  vector  AF: 

JAV&  AF  =  F(U  +  AU)-F(U).  (4.17) 

This  idea  of  the  matrix-free  approach,  in  which  the  product  of  Jacobian  matrix  and  incre¬ 
mental  vector  is  approximated  by  the  increment  of  the  flux  vector,  was  first  introduced  in 
the  work  of  Sharov  and  Nakahashi  [13].  The  forward  sweep  and  backward  sweep  steps  can 
then  be  expressed  as 

AU*  =  ZT1  R,  -  J2  i(AF*  -  |A,7|AU*).v,7  ,  (4.18) 

j'J<i 

AU,  =  AU*  -  D~l  ]T  ^(AF,  -  |A.,7|AU;).v,7.  (4.19) 

j:j>i 

The  most  remarkable  achievement  of  this  approximation  is  that  there  is  no  need  to  store  the 
upper  and  lower  matrics  U  and  L,  which  substantially  reduces  the  memory  requirements. 
It  is  found  that  this  approximation  does  not  compromise  any  numerical  accuracy,  and  the 
extra  computational  cost  is  negligible. 

Although  the  LU-SGS  method  is  more  efficient  than  its  explicit  counterpart,  a  significant 
number  of  time  steps  are  still  required  to  achieve  the  steady-state  solution,  due  to  the  nature 
of  the  approximation  factorization  schemes.  One  way  to  speed  up  the  convergence  is  to  use 
iterative  methods.  In  this  work,  the  system  of  linear  equations  is  solved  by  the  generalized 
minimal  residual  (GMRES)  method  of  Saad  and  Schultz  [8],  This  is  a  generalization  of 
the  conjugate  gradient  method  for  solving  a  linear  system  where  the  coefficient  matrix 
is  not  symmetric  and/or  positive  definite.  The  use  of  GMRES  combined  with  different 
preconditioning  techniques  is  becoming  widespread  in  the  CFD  community  for  the  solution 
of  the  Euler  and  Navier-Stokes  equations  [3,  5-7],  GMRES  minimizes  the  norm  of  the 
computed  residual  vector  over  the  subspace  spanned  by  a  certain  number  of  orthogonal 
search  directions.  It  is  well  known  that  the  speed  of  convergence  of  an  iterative  algorithm 
for  a  linear  system  depends  on  the  condition  number  of  the  matrix  A.  GMRES  works  best 
when  the  eigenvalues  of  matrix  A  are  clustered.  The  easiest  and  the  most  common  way  to 
improve  the  efficiency  and  robustness  of  GMRES  is  to  use  preconditioning  to  attempt  to 
cluster  the  eigenvalues  at  a  single  value.  The  preconditioning  technique  involves  solving  an 
equivalent  preconditioned  linear  system, 

AAU  =  R,  (4.20) 

instead  of  the  original  system  (4.4),  in  the  hope  that  A  is  well  conditioned.  Left  precondi¬ 
tioning  involves  premultiplying  the  linear  system  with  a  matrix  as 

P-'AAU  =  P~'  R, 


(4.21) 
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where  P  is  the  preconditioning  matrix.  The  best  preconditioning  matrix  for  A  would  cluster 
as  many  eigenvalues  as  possible  at  unity.  Obviously,  the  optimal  choice  of  P  is  A,  in  which 
case  the  underlying  matrix  problem  for  GMRES  is  trivially  solved  with  one  Krylov  vector. 
The  motivation  for  preconditioning  is  twofold:  (a)  reduce  the  computational  effort  required 
to  solve  the  linearized  system  of  equations  at  each  time-step;  (b)  reduce  the  total  number  of 
time  steps  required  to  obtain  a  steady  state  solution.  Preconditioning  will  be  cost-effective 
only  if  the  additional  computational  work  incurred  for  each  subiteration  is  compensated  for 
by  a  reduction  in  the  total  number  of  iterations  to  convergence.  In  this  way,  the  total  cost  of 
solving  the  overall  nonlinear  system  is  reduced.  In  the  present  work,  the  LU-SGS  presented 
above  is  used  as  a  preconditioner,  i.e., 

P  =  (D  +  L)D~'(D  +  U).  (4.22) 

A  clear  advantage  of  the  LU-SGS  preconditioner  is  that  it  uses  the  Jacobian  matrix  of 
the  linearized  scheme  as  a  preconditioner  matrix,  as  compared  with  ILU  preconditioner. 
Consequently  it  does  not  require  any  additional  memory  storage  and  computational  effort 
to  store  and  compute  the  preconditioner  matrix.  The  preconditioned  restarted  GMRES 
algorithm  is  described  below. 

Algorithm.  Restarted  preconditioned  GMRES(&). 

For  l  =  1 ,  m  do  m  restart  iterations 

v0  =  R  —  A  AUo  initial  residual 

r0  :=  P~l\o  preconditioning  step 

:=  II ro II2  initial  residual  norm 

Vi  :—ro/p  define  initial  Krylov 

for  j  =  1,  k  do  inner  iterations 

y j  :=  A\j  matrix-vector  product 

w j  :=  P~lyj  preconditioning  step 

For  i  =  1 ,  j  do  Gram-Schmidt  step 

Wj  =  W J  -  hijVi 

EndDo 

hj+U  :=  II wy  II 2 

V;+1  :=  w j! h j+ij  define  Krylov  vector 
EndDo 

z  :=  min^  ||/3ei  —  H z||2  least  squares  solve 
AU  :=  AU0  +  Yj7=  1  yizi  approximate  solution 
if  ||/tej  -  Hz\\2  <  e  exit  convergence  check 
AUo  •=  AU  restart 

EndDo 

Note  that  the  above  GMRES  algorithm  only  requires  matrix-vector  products,  the  same 
technique  used  in  the  LU-SGS  method  can  be  applied  to  eliminate  the  storage  of  the  upper 
and  lower  matrices. 

The  present  GMRES+LU-SGS  method  only  requires  the  storage  of  the  diagonal  matrix.  In 
addition,  a  storage  corresponding  to  2  x  nedge  is  required  for  the  two  index  arrays,  which 
are  necessary  to  achieve  the  vectorization  of  LU-SGS  method.  The  need  for  additional 
storage  associated  with  the  GMRES  algorithm  is  an  array  of  size  (k  +  2)  x  neqns  x  npoin , 
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where  k  is  the  number  of  search  directions.  Since  the  GMRES+LU-SGS  is  completely 
separated  from  the  flux  computation  procedure,  memory,  which  is  used  to  compute  fluxes 
can  be  used  by  the  GMRES+LU-SGS.  Overall,  the  extra  storage  of  the  GMRES+LU-SGS 
method  is  approximately  10%  of  the  total  memory  requirements. 

5.  NUMERICAL  RESULTS 

The  present  implicit  method  has  been  used  to  compute  a  variety  of  compressible  flow 
problems  for  a  wide  range  of  flow  conditions,  from  subsonic  to  supersonic,  for  both  inviscid 
and  viscous  flows,  in  both  2D  and  3D.  Only  a  few  typical  examples  in  3D  are  presented 
here  to  demonstrate  the  effectiveness  and  robustness  of  the  present  implicit  method  over  the 


FIG.  1.  (a)  Surface  mesh  used  for  computing  channel  flow  (nclem  =  68,097,  npoin  =  1 3,09 1 ,  nboun  =  4,442). 
(b)  Computed  pressure  contours  on  the  channel  surface  at  M,„  —0.5.  (c)  Computed  Mach  number  contours  on  the 
channel  surface  at  Min  —  0.5. 
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Time  steps 


FIG.  1.  (d)  Convergence  history  versus  time  steps  for  subsonic  channel  flow  using  different  schemes:  explicit, 
GMRES+ILU,  LU-SGS,  GMRES+LU-SGS,  matrix-free  LU-SGS,  and  matrix-free  GMRES+LU-SGS.  (e)  Con¬ 
vergence  history  versus  CPU  time  for  subsonic  channel  flow  using  different  schemes:  explicit,  GMRES+ILU, 
LU-SGS,  GMRES+LU-SGS,  matrix-free  LU-SGS,  and  matrix-free  GMRES+LU-SGS. 


existing  implicit  methods.  No  attempt  has  been  made  to  use  our  GMRES+ILU  method  [6] 
to  solve  the  Navier-Stokes  equations  and  large-scale  problems,  as  its  storage  requirement 
exceeds  the  memory  limitation  of  the  computer  available  to  us. 

All  of  the  grids  used  here  were  generated  by  the  advancing  front  technique  [17].  All 
computations  were  started  with  uniform  flow.  The  relative  L2  norm  of  the  density  residual 
is  taken  as  a  criterion  to  test  convergence  history.  The  solution  tolerance  for  GMRES  is 
set  to  0.1  with  10  search  directions  and  20  iterations.  We  observed  that  during  the  first  few 
time  steps,  more  iterations  are  spent  to  solve  the  system  of  the  linear  equations:  even  20 
iterations  cannot  guarantee  that  the  stopping  criterion  will  be  satisfied  for  some  problems. 
However,  it  only  takes  four  or  five  iterations  to  solve  the  linear  equations  at  a  later  time, 
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FIG.  1.  (f)  Convergence  history  of  linear  system  at  different  time-steps  for  subsonic  flow  by  matrix-free 

GMRES+LU-SGS  method,  (g)  Effect  of  CFL  number  on  convergence  history  by  matrix-free  LU-SGS  method 
for  subsonic  flow. 


and  global  convergence  is  not  affected  by  a  lack  of  linear  system  convergence  during  the 
first  few  time  steps. 

A.  Inviscid  Subsonic  Flow  in  a  Channel  with  a  Circular  Bump  on  the  Lower  Wall 

The  first  example  is  the  well-known  Ni’s  test  case:  a  subsonic  flow  in  a  channel  with  a 
10%  thick  circular  bump  on  the  bottom.  The  length  of  the  channel  is  3,  its  height  is  1 ,  and 
its  width  is  0.5.  The  inlet  Mach  number  is  0.5.  This  is  a  three-dimensional  simulation  of 
a  two-dimensional  flow.  Since  no  shock  waves  are  present  in  the  flow  fields,  all  solutions 
were  obtained  using  a  second-order  scheme  without  any  limiters.  The  mesh,  which  contains 
13,891  grid  points,  68,097  elements,  and  4442  boundary  points  is  depicted  in  Fig.  la.  The 
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FIG.  1.  (h)  Convergence  history  of  the  residual  for  coarse  and  fine  meshes  for  subsonic  flow  by  explicit  me¬ 
thod.  (i)  Convergence  history  of  the  residual  for  coarse  and  fine  meshes  for  subsonic  flow  by  GMRES+LU-SGS 
implicit  method. 


computed  Mach  number  and  pressure  contours  in  the  flow  field  are  shown  in  Figs,  lb  and 
lc,  respectively.  Figures  Id  and  le  display  a  comparison  of  convergence  histories  among  the 
explicit  scheme,  the  GMRES+ILU  scheme,  the  LU-SGS  scheme,  the  GMRES+LU-SGS 
scheme,  the  matrix-free  LU-SGS  scheme,  and  the  matrix-free  GMRES+LU-SGS  scheme 
versus  time  steps  and  CPU  time,  respectively.  The  explicit  method  used  a  three-stage  Runge- 
Kutta  time-stepping  scheme  with  local  time  stepping  and  implicit  residual  smoothing.  The 
computation  was  advanced  with  a  CFL  number  of  4.  A  CFL  number  of  10,000  was  used 
by  all  implicit  methods  in  the  computation.  It  is  clear  that  GMRES+LU-SGS  methods 
are  superior  to  both  GMRES+ILU  and  LU-SGS  methods.  The  present  GMRES+LU-SGS 
method  is  over  100  times  faster  than  its  explicit  counterpart  for  this  particular  case.  This 
is  due  to  the  fact  that  the  convergence  of  the  explicit  method  deteriorates  dramatically  for 
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a  low-speed  flow  problem.  From  Fig.  le  we  observe  that  both  the  matrix-free  LU-SGS 
scheme  and  the  matrix-free  GMRES+LU-SGS  scheme  yield  a  convergence  history  that 
is  identical  to  their  respective  matrix  counterparts.  This  indicates  that  both  matrix-free 
schemes  yield  solutions  that  are  identical  to  their  matrix  counterparts.  However,  the  matrix- 
free  GMRES+LU-SGS  scheme  is  slightly  more  expensive  than  its  matrix  counterpart,  as  we 
can  see  from  Fig.  1  f.  This  is  due  the  fact  that  for  each  GMRES  iteration,  the  former  involves 
the  numerical  flux  computation,  while  the  latter  involves  the  computation  of  a  matrix  vector 
product,  which  is  apparently  cheaper  to  calculate.  It  is  worth  noting  that  per  time  step  the 
present  LU-SGS  method  costs  approximately  half  of  a  three-stage  Runge-Kutta  explicit 
method  with  a  residual  smoothing.  However,  the  cost  of  the  GMRES+LU-SGS  method  per 
time  step  is  not  fixed,  as  more  iterations  and,  therefore,  more  CPU  time,  are  required  to 
solve  the  linear  system  at  earlier  time  steps.  This  is  shown  in  Fig.  1  f,  where  the  convergence 
history  of  the  linear  system  at  different  time  steps  using  the  matrix-free  GMRES+LU- 
SGS  method  is  displayed.  Typically,  only  a  few  iterations  are  sufficient  to  meet  the  stop 
criterion  at  lattertime-steps.  Finally,  Fig.  lg  illustrates  the  convergence  history  of  matrix-free 
LU-SGS  method  for  different  CFL  number.  Although,  the  LU-SGS  method  is  stable  for  a 


FIG.  2.  (a)  Upper  and  lower  surface  mesh  used  for  M6wing  configuration  (nelem  =  74 1 ,098,  npoin  =  1 36,05 1 , 
nboun  =  20,762).  (b)  Computed  pressure  contours  on  the  upper  and  lower  surface  at  =  0.84  and  a  =  3.06". 
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FIG.  2.  (c)  Comparison  between  computed  and  experimental  surface  pressure  coefficient  for  wing  section  at 

20%  semispan,  (d)  Comparison  between  computed  and  experimental  surface  pressure  coefficient  for  wing  section 
at  44%  semispan,  (e)  Comparison  between  computed  and  experimental  surface  pressure  coefficient  for  wing 
section  at  65%  semispan,  (f)  Comparison  between  computed  and  experimental  surface  pressure  coefficient  for 
wing  section  at  80%  semispan,  (g)  Comparison  between  computed  and  experimental  surface  pressure  coefficient 
for  wing  section  at  90%  semispan,  (h)  Comparison  between  computed  and  experimental  surface  pressure  coefficient 
for  wing  section  at  95%  semispan. 
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FIG.  2.  (i)  Residual  convergence  history  versus  time  steps  for  M6wing  using  different  schemes:  explicit, 

matrix-free  LU-SGS,  GMRES+ILU,  and  matrix-free  GMRES+LU-SGS.  (j)  Residual  convergence  history  versus 
CPU  time  for  M6wing  using  different  schemes:  explicit,  matrix-free  LU-SGS,  GMRES+ILU,  and  matrix-free 
GMRES+LU-SGS. 


very  large  CFL  number,  the  convergence  history  is  almost  indistinguishable  for  all  the  three 
CFL  numbers  used. 

Finally,  the  same  computation  has  been  performed  on  a  finer  mesh  to  study  the  behav¬ 
ior  of  convergence  history  as  the  grid  is  refined.  The  refined  mesh  contains  99,683  grid 
points,  533,060  elements,  and  17,391  boundary  points.  Figures  Ih  and  li  show  the  conver¬ 
gence  histories  of  the  residual  for  coarse  and  finer  grids  using  the  explicit  method  and  the 
matrix-free  GMRES+LU-SGS  method,  respectively.  As  expected,  the  explicit  method  for 
the  refined  mesh  requires  approximately  twice  the  number  of  time  steps  to  achieve  the  same 
convergence;  the  behavior  of  convergence  history  for  the  implicit  method  is  quite  similar 
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as  on  a  coarse  mesh,  although  the  rate  of  convergence  does  slow  down,  demonstrating  that 
the  advantage  of  the  present  implicit  method  is  not  diminished  for  finer  grids. 

B.  ON  ERA  M6  Wing  Configuration 

The  second,  well-documented  case  is  the  inviscid  transonic  flow  over  a  ONER  A  M6  wing 
configuration.  The  M6  wing  has  a  leading-edge  sweep  angle  of  30°,  an  aspect  of  3.8,  and  a 
taper  ratio  of 0.562.  The  airfoil  section  of  the  wing  is  the  ONERA  “D”  airfoil,  which  is  a  10% 
maximum  thickness-to-chord  ratio  conventional  section.  The  flow  solutions  are  presented 
at  a  Mach  number  of  0.84  and  an  angle  of  attack  of  3.06.  The  mesh  used  in  the  computation 


FIG.  3.  (a)  Surface  mesh  used  for  Wing/Pylon /Finned-Store  configuration  (nelem  =  1,329,694,  npoin  = 
239,547,  nboun  =  27,359).  (b)  Computed  pressure  contours  on  the  upper  surafce  at  Moo  =  0.95  and  a  —  0°. 
(c)  Computed  pressure  contours  on  the  lower  surafce  at  =  0.95  and  a  =  0°. 
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FIG.  3.  (d)  Residual  convergence  history  versus  time  steps  for  Wing/Pylon/Finned-Store  configuration 

using  matrix-free  LU-SGS  and  matrix-free  GMRES+LU-SGS.  (e)  Residua!  convergence  history  versus  CPU 
time  for  Wing/Pylon/Finned-Store  configuration  using  different  schemes:  matrix-free  LU-SGS  and  matrix-free 
GMRES+LU-SGS. 


consists  of  741,095  elements,  136,051  grid  points,  and  20,762  boundary  points.  The  upper 
and  surface  meshes  are  shown  in  Fig.  2a.  The  computed  pressure  contours  on  the  upper  and 
lower  surfaces  are  displayed  in  Fig.  2b.  The  upper  surface  contours  clearly  show  the  sharply 
captured  lambda-type  shock  structure  formed  by  the  two  inboard  shock  waves,  which  merge 
together  near  80%  semispan  to  form  the  single  strong  shock  wave  in  the  outboard  region  of 
the  wing.  The  computed  pressure  coefficient  distributions  are  compared  with  experimental 
data  r  1 8]  at  six  spanwise  stations  in  Figs.  2c-2h.  We  can  observe  that  there  is  only  one 
grid  point  within  the  shock  structure;  this  demonstrates  the  sharp  shock-capturing  ability 
of  AUSM+  scheme.  The  results  obtained  compare  closely  with  experimental  data,  except 
at  the  root  stations,  due  to  lack  of  viscous  effects.  Figures  2i  and  2j  display  a  comparison 
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FIG.  3.  (f)  Effect  of  CFL  number  on  convergence  history  for  for  Wing/Pylon/Finned-Store  configuration 

using  matrix-free  LU-SGS.  (g)  Effect  of  CFL  number  on  convergence  history  for  for  Wing/Pylon/Finned-Store 
configuration  using  matrix-free  GMRES+LU-SGS . 


of  convergence  histories  among  the  explicit  scheme,  matrix-free  LU-SGS  scheme, 
GMRES+ILU  scheme,  and  matrix-free  GMRES+LU-SGS  scheme  versus  time  steps  and 
CPU  time,  respectively.  GMRES+LU-SGS  methods  provide  the  best  convergence  per¬ 
formance.  The  present  GMRES+LU-SGS  method  is  about  three  times  faster  than  the 
GMRES+ILU  method,  about  six  times  faster  than  the  LU-SGS  methods  and  14  times 
faster  than  its  explicit  method. 


C.  Wing/ Pylon/ Finned- Store  Configuration 

The  third  test  case  is  conducted  for  a  wing/pylon/finned-store  configuration  reported  in 
Ref.  [19].  The  configuration  consists  of  a  clipped  delta  wing  with  a  45°  sweep  composed 
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FIG.  4.  (a)  Surface  mesh  used  for  computing  interna!  supersonic  flow  (nclem  =  542,895,  npoin  =  106,285, 
nboun  =  30,094).  (b)  Computed  density  contours  for  supersonic  inlet  flow  (Mf„  =  3).  (c)  Computed  Mach  number 
contours  for  supersonic  inlet  flow  (Mm  =  3).  (d)  Computed  pressure  contours  for  supersonic  inlet  flow  (Mm  =  3). 

of  a  constant  NACA64010  symmetric  airfoil  section.  The  wing  has  a  root  chord  of  16  in., 
a  semispan  of  13  in.,  and  a  taper  ratio  of  0.134.  The  pylon  is  located  at  the  midspan  station 
and  has  a  cross  section  characterized  by  a  flat  plate  closed  at  the  leading  and  trailing  edges 
by  a  symmetrical  ogive  shape.  The  width  of  the  pylon  is  0.294  in.  The  four  fins  on  the  store 
are  defined  by  a  constant  NACA0008  airfoil  section  with  a  leading-edge  sweep  of  45°  and 
a  truncated  tip.  The  mesh  used  in  the  computation  is  shown  in  Fig.  3a.  It  contains  1 ,329,694 
elements,  239,547  grid  points,  and  27,359  boundary  points.  The  flow  solutions  are  presented 
at  a  Mach  number  of  0.95  and  an  angle  of  attack  of  0°.  Figures  3b  and  3c  show  the  pressure 
contours  on  the  upper  and  lower  wing  surface,  respectively.  Because  of  large  size  of  mesh,  no 
attempt  has  been  made  using  matrix  LU-SGS  and  GMRES+LU-SGS  methods  to  compute 
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FIG.  4.  (e)  Residual  convergence  history  versus  time  steps  for  supersonic  inlet  configuration  using  different 

schemes:  matrix-free  LU-SGS,  matrix-free  GMRES+LU-SGS,  and  explicit  method,  (f)  Residual  convergence 
history  versus  CPU  time  for  supersonic  inlet  configuration  using  different  schemes:  matrix-free  LU-SGS,  matrix- 
free  GMRES+LU-SGS,  and  explicit  method. 


this  problem.  Figures  3d  and  3e  display  a  comparison  of  convergence  histories  between 
a  matrix-free  LU-SGS  scheme  and  a  matrix-free  GMRES+LU-SGS  scheme  versus  time 
steps  and  CPU  time,  respectively.  Again,  the  GMRES+LU-SGS  method  provides  faster 
convergence  than  the  LU-SGS  method.  Figures  3f  and  3g  illustrate  the  convergence  history 
using  different  CFL  numbers  of  GMRES+LU-SGS  and  LU-SGS  methods,  respectively. 
Again,  we  can  see  that,  although  the  LU-SGS  method  is  stable  for  a  very  large  CFL  number, 
the  convergence  history  is  almost  indistinguishable  for  all  the  three  CFL  numbers  used. 
However,  a  substantial  gain  can  be  achieved  by  using  a  larger  CFL  number  for  GMRES+ 
LU-SGS  method,  although  no  significant  difference  can  be  observed  once  the  CFL  number 
is  large  enough. 
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D.  Supersonic  Duct  Flow 

This  internal  inviscid  supersonic  flow  case,  taken  from  Nakahashi  and  Saito  [20],  repre¬ 
sents  part  of  a  scramjet  intake.  The  inlet  Mach  number  is  3.  The  total  length  of  the  device  is 
/  —  8.0,  and  the  element  size  was  set  uniformly  throughout  the  domain  to  8  =  0.03.  The  mesh 
shown  in  Fig.  4a  consists  of  542,895  elements,  106,285  grid  points,  and  30,094  boundary 
points.  The  computed  density,  Mach  numbers,  and  pressure  contours  are  shown  in  Figs.  4b, 


FIG.  5.  (a)  Surface  mesh  used  for  flat  plate  configuration  (nclem  =  8 1 ,885,  npoin  =  1 5,694,  nboun  =  3,774). 
(b)  Computed  Mach  number  contours  for  flat  plate  at  M^  -  0.4,  a  =  0.0,  and  Re  =  1 0,000.  (c)  Computed  boundary 
layer  velocity  profile  over  a  flat  plate  at  M^  =  0.4,  a  =  0.0,  and  Re  =  10,000. 
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FIG.  5.  (d)  Residual  convergence  history  versus  time  steps  for  fiat  plate  using  different  schemes:  explicit, 

matrix-free  LU-SGS,  and  matrix-free  GMRES+LU-SGS.  (e)  Residual  convergence  history  versus  cpu  time  for 
flat  plate  using  different  schemes:  explicit,  matrix-free  LU-SGS,  and  matrix-free  GMRES+LU-SGS. 

4c,  and  4d,  respectively.  Figures  4e  and  4f  illustrate  the  convergence  history  among  differ¬ 
ent  numerical  schemes:  matrix-free  LU-SGS,  matrix-free  GMRES+LU-SGS,  and  explicit 
methods,  respectively.  It  indicates  that  the  GMRES+LU-SGS  method  is  superior  to  the 
LU-SGS  method.  CPU  time  comparison  shows  that  the  GMRES+LU-SGS  method  is  about 
eight  time  faster  than  the  explicit  method  for  this  particular  problem. 

E.  Laminar  Flow  Past  a  Flat  Plate 

In  this  test  case,  Blasius  boundary  layers  are  computed  for  a  flat  plate  at  a  Mach  number 
of  0.4  and  a  chord  Reynolds  number  of  10,000.  The  computational  domain  is  considered 
from  x  =  -0.5  to  *  =  1 ,  y  =  0  to  y  =  1,  and  z  =  0  to  z  =  0.5,  where  the  plate  starts  at  x  =  0. 
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The  mesh  used  in  the  computation  is  shown  in  Fig.  5a.  It  contains  81,885  elements,  15,694 
points,  and  3774  boundary  points.  The  computed  Mach  number  contours  in  the  flow  field 
are  depicted  in  Fig.  5b,  where  the  development  of  a  boundary  layer  can  be  clearly  observed. 
Figure  5c  shows  the  comparison  of  the  Blasius  velocity  profile  and  the  computed  velocity 
profiles  as  scaled  by  the  Blasius  similarity  law  for  all  boundary  layer  points.  The  Blasius 
velocity  profile  is  almost  identically  matched  by  all  data  points  with  the  exception  of  a 


b 


FIG.  6.  (a)  Surface  mesh  used  for  NACA0012  airfoil  configuration  (nclem  —  697,655,  npoin—  128,448, 

nboun  =  22,925).  (b)  Computed  velocity  vector  distribution  near  leading  edge  of  airfoil  at  M(X,  =  0.5,  w  =  0.0,  and 
Re  =  5,000. 
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FIG.  6.  (c)  Computed  pressure  contours  on  the  surface  at  =  0.5,  a  =  0.0,  and  Re  =  5,000.  (d)  Computed 

Mach  number  contours  on  the  surface  at  MTO  =  0.5,  a  =  0.0,  and  Re  =  5,000. 

few  points  near  leading  edge.  The  slight  discrepancy  for  these  points  is  attributed  to  the 
leading  edge  singularity.  Figures  5d  and  5e  show  a  comparison  of  convergence  histories 
among  different  numerical  schemes:  matrix-free  LU-SGS,  matrix-free  GMRES+LU-SGS, 
and  explicit  methods,  respectively.  It  indicates  that  the  GMRES+LU-SGS  method  is  superior 
to  the  LU-SGS  method.  CPU  time  comparison  shows  that  the  GMRES+LU-SGS  method 
is  about  10  times  faster  than  the  explicit  method  for  this  particular  problem. 

F.  Laminar  Flow  over  a  NACA0012  Airfoil 

This  test  case  involves  a  laminar  flow  past  a  NACA0012  airfoil  at  a  Mach  number  of 
0.5,  an  angle  of  attack  of  0°,  and  a  chord  Reynolds  number  of  5000.  This  computation  was 
performed  to  see  the  effectiveness  of  the  present  matrix-free  GMRES+LU-SGS  method  for 
the  solution  of  the  Navier-Stokes  equations.  The  mesh  used  in  the  computation  is  shown 
in  Fig.  6a.  It  contains  697,655  elements,  128,488  grid  points,  and  22,925  boundary  points. 
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Time  steps 


FIG.  6.  (e)  Residual  convergence  history  versus  time  steps  for  NACA00I2  airfoil  using  different  schemes: 

explicit,  matrix-free  LU-SGS,  and  matrix-free  GMRES+LU-SGS.  (f )  Residual  convergence  history  versus epu  time 
for  NACAOO 1 2  airfoil  using  different  schemes:  explicit,  matrix-free  LU-SGS,  and  matrix-free  GMRES+LU-SGS. 

The  computed  velocity  vector  distribution  in  the  vicinity  of  the  trailing  edge  of  the  airfoil 
is  shown  in  Fig.  6b,  where  the  separation  and  a  small  recirculation  bubble  can  be  clearly 
observed.  The  computed  separation  point  is  at  81.6%  chord,  which  compares  well  to  the 
one  obtained  by  Mavriplis  [22],  The  computed  pressure  and  Mach-number  contours  are 
shown  in  Figs.  6c  and  6d,  respectively.  Figures  6e  and  6f  illustrate  the  convergence  history 
among  different  numerical  schemes:  matrix-free  LU-SGS,  matrix-free  GMRES+LU-SGS, 
and  explicit  methods,  respectively.  It  indicates  that  the  GMRES+LU-SGS  method  is  far 
superior  to  the  LU-SGS  method.  CPU  time  comparison  shows  that  the  GMRES+LU-SGS 
method  is  more  than  two  orders  of  magnitude  faster  than  the  explicit  method  for  this 
particular  problem.  The  effectiveness  of  the  present  matrix-free  GMRES+LU-SGS  method 
for  the  solution  of  the  Navier-Stokes  equations  is  clearly  demonstrated  in  this  example. 
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6.  CONCLUSIONS 

A  matrix-free  implicit  method  has  been  developed  to  solve  the  three-dimensional  Navier- 
Stokes  and  Euler  equations  on  unstructured  meshes.  The  developed  method  has  been  used 
to  compute  the  compressible  flows  around  3D  complex  aerodynamic  configurations  for  a 
wide  range  of  flow  conditions  from  subsonic  to  supersonic.  The  numerical  results  obtained 
indicate  that  the  use  of  the  GMRES+LU-SGS  method  leads  to  a  significant  increase  in 
performance  over  the  best  current  implicit  methods,  the  GMRES+ILU  and  the  LU-SGS 
methods,  while  maintaining  memory  requirements  that  are  competitive  with  its  explicit 
counterpart.  In  comparison  to  the  explicit  method,  we  demonstrate  an  overall  speedup 
factor  from  eight  to  more  than  one  order  of  magnitude  for  all  test  cases.  The  GMRES+ 
LU-SGS  method  has  also  been  extended  and  applied  successfully  to  solve  the  unsteady 
Euler  and  Navier-Stokes  equations  and  will  be  reported  in  a  later  paper.  The  current  work 
is  to  extend  the  present  GMRES+LU-SGS  method  for  turbulent  flow  problems. 
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ABSTRACT 

A  parallel  advancing  front  scheme  has  been  developed.  The  domain  to  be  gridded  is  first  subdivided  spatially 
using  a  relatively  coarse  octree.  Boxes  are  then  identified  and  gridded  in  parallel.  A  scheme  that  resembles 
closely  the  advancing  front  technique  on  scalar  machines  is  recovered  by  only  considering  the  boxes  of  the 
active  front  that  generate  small  elements.  The  procedure  has  been  implemented  on  the  SGI  Origin  class 
of  machines  using  the  shared  memory  paradigm.  Timings  for  a  variety  of  cases  show  speedups  similar  to 
those  obtained  for  flow  codes.  The  procedure  has  been  used  to  generate  grids  in  excess  of  a  hundred  million 
elements. 

Keywords.  Unstructured  Grid  Generation,  Parallel  Computing,  CFD. 


L  INTRODUCTION 

The  widespread  availability  of  parallel  machines  with 
large  memory,  solvers  that  can  harness  the  power  of 
these  machines,  and  the  desire  to  model  in  ever  in¬ 
creasing  detail  geometrical  and  physical  features  has 
lead  to  a  steady  increase  in  the  number  of  points  used 
in  field  solvers.  Grids  in  excess  of  107  elements  have 
become  common  for  production  runs  in  Computa¬ 
tional  Fluid  Dynamics  (CFD)  [Bau93,  Bau95,  Jou98, 
Yos98,  Mav99]  and  Computational  Electromagnetics 
(CEM)  [Dar97,Mor97].  The  expectation  is  that  in  the 
near  future  grids  in  excess  of  108  —  109  elements  will 
be  required.  While  many  solvers  have  been  ported 
to  parallel  machines,  grid  generators  have  lagged  be¬ 
hind.  For  applications  where  remeshing  is  an  integral 
part  of  simulations,  e.g.  problems  with  moving  bod¬ 
ies  [Loh90,  Mes93,  Mes95,  Bau96,  Kam96,  Loh98a, 
Has98]  or  changing  topologies  [Bau98,  Bau99],  the 
time  required  for  mesh  regeneration  can  easily  con¬ 
sume  more  than  50%  of  the  total  time  required  to 
solve  the  problem.  Faced  with  this  situation,  a  num¬ 
ber  of  efforts  have  been  reported  on  parallel  grid  gen¬ 
eration  [Loh92,  dCo94,  Sho95,  dCo95,  Oku96,  Che97, 
Oku97,  Sai99] . 

The  two  most  common  ways  of  generating  unstruc¬ 
tured  grids  are  the  Advancing  Front  Technique  (AFT) 
[Per87,  Per88,  Loh88a,b,  Per90,  Per92,  Jin93,  Fry94, 
Loh96]  and  the  Generalized  Delaunay  Triangulation 
(GDT)  [Bak89,  Geo91,  Wea92,  Wea94,  Mar95].  The 
AFT  introduces  one  element  at  a  time,  while  the 
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GDT  introduces  a  new  point  at  a  time.  Thus,  both 
of  these  techniques  are,  in  principle,  scalar  by  na¬ 
ture,  with  a  large  variation  in  the  number  of  opera¬ 
tions  required  to  introduce  a  new  element  or  point. 
While  coding  and  data  structures  may  influence  the 
scalar  speed  of  the  ‘core’  AFT  or  GDT,  one  often 
finds  that  for  large-scale  applications,  the  evaluation 
of  the  desired  element  size  and  shape  in  space,  given 
by  background  grids,  sources  or  other  means  [Loh96] 
consumes  the  largest  fraction  of  the  total  grid  genera¬ 
tion  time.  Unstructured  grid  generators  based  on  the 
AFT  may  be  parallelized  by  invoking  distance  argu¬ 
ments,  i.e.  the  introduction  of  a  new  element  only  af¬ 
fects  (and  is  affected  by)  the  immediate  vicinity.  This 
allows  for  the  introduction  of  elements  in  parallel,  pro¬ 
vided  that  sufficient  distance  lies  between  them. 

Several  years  ago,  the  author  and  his  colleagues  in¬ 
troduced  a  parallel  AFT  based  on  the  subdivision 
of  the  background  grid  [Loh92,  Sho95].  While  used 
for  some  demonstration  runs,  this  scheme  was  not 
general  enough  for  a  production  environment.  The 
background  grid  had  to  be  adapted  in  order  to  be 
sufficiently  fine  for  a  balanced  workload.  As  only 
background  grid  elements  covering  the  domain  to  be 
gridded  were  allowed,  complex  in/out  tests  had  to  be 
carried  out  to  remove  refined  elements  lying  outside 
the  domain  to  be  gridded.  Furthermore,  element  size 
specified  at  CAD  entities  could  not  be  ‘propagated’ 
into  the  domain,  as  is  the  case  in  the  scalar  AFT, 
disabling  an  option  favoured  by  many  users  and  ren¬ 
dering  many  grid  generation  data  sets  unusable.  The 
otherwise  positive  experience  gained  with  this  parallel 
AFT  prompted  the  search  for  a  more  general  paral¬ 
lel  AFT.  The  key  requirement  was  a  parallel  AFT 
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that  changes  the  current,  evolved  and  mature  scalar 
AFT  as  little  as  possible,  while  achieving  significant 
speedups  on  common  parallel  machines.  This  implies 
that  the  parallelism  should  be  applied  at  the  level  of 
the  current  front,  and  not  globally. 


2.  PARALLEL  GRIDDING  SCHEME 


The  advancing  front  technique  attempts  to  introduce 
an  element  at  a  time  into  an  as  yet  ungridded  domain 
by  eliminating  the  face  generating  the  smallest  new 
element  from  the  front  [Loh88a,b].  Given  that  the 
introduction  of  an  element  only  affects  its  immedi¬ 
ate  neighbourhood,  one  could,  in  principle,  introduce 
many  elements  at  the  same  time,  provided  they  are 
sufficiently  far  apart.  A  convenient  way  of  delimiting 
the  possible  zones  where  elements  may  be  introduced 
by  each  processor  is  via  boxes.  These  boxes  may  be 
obtained  in  a  variety  of  ways,  i.e.  via  bins,  binary 
recursive  trees,  or  octrees.  We  have  found  the  octree 
to  be  the  best  of  these  possibilities,  particularly  for 
grids  with  a  large  variation  of  element  size.  In  order 
to  recover  a  parallel  gridding  procedure  that  resem¬ 
bles  closely  the  advancing  front  technique  on  scalar 
machines,  only  the  boxes  covering  the  active  front  in 
regions  where  the  smallest  new  elements  are  being  in¬ 
troduced  are  considered.  After  these  boxes  have  been 
filled  with  elements,  the  process  starts  anew:  a  new 
octree  is  built,  new  boxes  are  created  and  meshed  in 
parallel.  The  procedure  is  summarized  schematically 
for  a  2-D  case  in  Figure  1. 


Figure  1  Parallel  Grid  Generation 

At  the  end  of  each  parallel  gridding  pass,  each  one  of 
the  boxes  gridded  can  have  an  internal  boundary  of 
faces.  For  a  large  number  of  boxes,  this  could  result 
in  a  very  large  number  of  faces  for  the  active  front. 
This  problem  can  be  avoided  by  shifting  the  boxes 
slightly,  and  then  regridding  them  again  in  parallel, 
as  shown  in  Figure  2.  This  simple  technique  has  the 
effect  of  eliminating  almost  all  of  the  faces  between 
boxes  with  a  minor  modification  of  the  basic  parallel 
gridding  algorithm. 


Figure  2  Shift  and  Regrid  Technique 

If  we  define  as  the  minimum  element  size  in  the 
active  front,  and  as  smin  the  minimum  box  size  in 
which  elements  are  to  be  generated,  the  parallel  AFT 
proceeds  els  follows: 

WHILE:  There  are  active  faces  left: 

-  Form  an  octree  with  minimum  octant  size 
for  the  active  points; 

-  Retain  the  octants  that  have  faces  that  will  gen¬ 
erate  elements  of  size  dmin  to  c/  •  ; 

-  If  too  many  octants  are  left:  agglomerate  them 
into  boxes; 

-  DO  ISHFT=0,2: 

-  IF:  ISHFT.NE.O: 

Shift  the  boxes  by  a  preset  amount; 

-  ENDIF 

-  Generate,  in  parallel,  elements  in  these 
boxes,  allowing  only  elements  up  to  a  size 
of  c/  • 

-  ENDDO 

Increase  —  1.5  +  Smiti  =  1.5*  fiminj 

ENDWHILE 

The  increase  factor  allowed  is  typically  in  the  range 
ci  =  1.5  —  2.0.  The  shift  vectors  are  given  by 

s  —  Ss (1,1,1),  Ss  =  mm(0.5  *  Smm  ,  2.0  *  dmin )  . 

We  remark  that  the  octree  used  to  compute  the  boxes 
for  parallel  grid  generation  is  very  coarse  compared 
to  the  element  size  specified  by  the  user.  The  edge- 
length  of  the  finest  octree  box  is  of  the  order  of  20 
to  50  times  the  specified  element  size.  This  implies 
that  its  construction  is  very  fast,  and  can  be  accom¬ 
plished  on  a  single  processor  without  discernable  CPU 
penalty. 

3.  WORK  ESTIMATION  AND  BALANCE 

The  procedure  outlined  above  will  work  optimally  if 
each  box  requires  approximately  the  same  CPU  time 
to  complete  its  grid.  This  implies  that  a  good  work 
estimate  should  be  provided.  Given  that  the  boxes 
are  not  body  conforming,  even  for  uniform  grids  the 
volume  to  be  gridded  can  vary  drastically  from  box  to 
box.  A  balanced  workload  can  be  obtained  by  starting 
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with  many  boxes,  estimating  the  work  to  be  done  for 
each  of  them,  and  then  gridding  in  parallel  groups 
of  boxes  with  similar  workload. 


b)  Balance  the  work: 


WHILE:  Work  unbalanced 

-  Estimate,  in  parallel,  the  work  for  each  box; 

-  Obtain  average  work  per  processor; 

-  IF:  A  box  has  a  work  estimate  above  average: 
Subdivide  it  further  (1:8)  and  balance  again; 

-  END IF 

-  Attempt  to  group  boxes  in  such  a  way  that  the 
work  in  each  group  is  close  to  average; 

-  IF:  Good  work  balance  impossible: 

Subdivide  boxes  with  highest  work  estimate  (1:8) 
and  balance  again; 

-  END IF 
ENDWHILE 


Figure  3  Estimation  of  Volume  to  be  Gridded 

The  volume  to  be  gridded  is  estimated  by  the  march¬ 
ing  cubes  procedure  shown  schematically  in  Figure  3. 
Given  the  dimensions  of  the  box,  and  the  list  of  ac¬ 
tive  faces,  the  box  is  first  subdivided  into  voxels  (i.e. 
small  cubes).  In  a  first  pass  over  the  faces,  the  voxels 
cut  by  faces  are  marked  as  ‘inside  the  domain’,  and 
an  average  normal  is  computed  for  each  cut  voxel. 
This  normal  information  is  used  in  a  subsequent  pass 
over  the  voxels  in  order  to  mark  the  neighbours  of 
cut  voxels  as  either  inside  or  outside  the  domain  to 
be  gridded.  The  remaining  voxels  are  then  marked 
as  inside  or  outside  in  several  sweeps  over  the  vox¬ 
els.  These  sweeps  are  carried  out  until  no  further 
voxels  can  be  marked.  Finally,  a  work  estimate  is  ob¬ 
tained  by  summing  the  expected  number  of  elements 
in  each  of  the  voxels  marked  as  inside  the  domain  to 
be  gridded.  This  work  estimation  procedure  is  done 
in  parallel. 

Given  the  estimated  work  in  each  of  the  boxes,  the 
load  is  balanced  in  such  a  way  that  each  processor 
receives  a  similar  amount  of  work.  The  assumption  is 
made  that  the  number  of  boxes  is  always  larger  than 
the  number  of  processors.  Should  this  not  be  the 
case,  the  boxes  are  subdivided  further  (8  new  boxes 
for  each  box).  If  any  given  box  has  a  work  estimate 
that  lies  above  the  average  work  per  processor,  this 
box  is  also  subdivided  further.  This  ‘greedy’  work 
balancing  algorithm  may  be  summarized  as  follows: 

a)  Obtain  a  minimum  nr.  of  active  boxes: 

WHILE:  The  number  of  boxes  is  smaller  than  the  num¬ 
ber  of  processors: 

-  Subdivide  boxes  (1:8); 

ENDWHILE 


In  many  instances,  boxes  within  a  group  will  share 
a  common  face.  In  order  to  avoid  the  (unnecessary) 
buildup  of  many  faces  at  the  borders  of  boxes,  an 
attempt  is  made  to  agglomerate  these  neighbouring 
boxes  within  groups.  This  is  done  recursively  by 
checking,  for  any  pair  of  boxes,  if  two  of  the  dimen¬ 
sions  are  the  same  and  the  boxes  are  coincident  in 
the  remaining  dimension.  If  so,  the  boxes  are  merged. 
Boxes  are  merged  recursively,  until  no  pair  of  boxes 
passes  the  test  outlined  above. 

4.  MESH  IMPROVEMENT 

After  the  generation  of  the  mesh  using  the  parallel 
advancing  front  technique  has  been  completed,  the 
mesh  quality  is  improved  by  a  combination  of  several 
algorithms,  such  as: 

-  Diagonal  swapping, 

-  Removal  of  Bad  Elements,  and 

-  Laplacian  smoothing. 

4.1  Diagonal  Swapping 

Diagonal  swapping  attempts  to  improve  the  quality 
of  the  mesh  by  reconnecting  locally  the  points  in  a 
different  way.  Examples  of  possible  3-D  swaps  are 
shown  in  Figures  4,5. 


Figure  4  Diagonal  Swap  Case  2:3 
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A  A 


Figure  5  Diagonal  Swap  Case  6:8 

The  optimality  criterion  used  is  the  one  proposed  by 
George  (1999). 


r\  _  hmaxS 

V 

where  hmaxyS  and  V  denote  the  maximum  edge 
length,  total  surface  area  and  volume  of  a  tetrahe¬ 
dron.  The  number  of  cases  to  be  tested  can  grow 
factorially  with  the  number  of  elements  surrounding 
an  edge.  Figure  6  shows  the  possibilities  to  be  tested 
for  4,5  and  6  elements  surrounding  an  edge. 

5* 


Figure  6  Swapping  Cases 

Given  that  these  tests  are  computationally  intensive, 
considerable  care  is  required  when  coding  a  fast  di¬ 
agonal  swapper.  Techniques  that  were  used  in  the 
present  implementation  include: 

-  Treatment  of  bad  ( Q  >  Qtoi ),  untested  elements 
only; 

-  Processing  of  elements  in  an  ordered  way,  start¬ 
ing  with  the  worst  (highest  chance  of  reconnec¬ 
tion); 

-  Rejection  of  bad  combinations  at  the  earliest  pos¬ 
sible  indication  of  worsening  quality; 

-  Marking  of  tested  and  unswapped  elements  in 
each  pass. 

As  stated  before,  the  computationally  intensive  part 
of  any  diagonal  swapping  is  sifting  through  all  possi¬ 
bilities  in  order  to  find  a  better  local  point  connectiv¬ 
ity.  One  observes  that  a  very  large  number  of  tests  are 


required  to  obtain  an  improvement.  This  implies  that 
parallelizing  only  the  checking  portion  of  a  diagonal 
swapper,  which  can  easily  be  done  as  no  swapping 
actually  occurs,  will  yield  very  good  speedups.  The 
parallel  swapper  can  then  be  summarized  as  follows: 

-  WHILE:  There  are  bad,  untested  elements  in  the 
heap: 

-  Check  the  worst  elements  in  parallel; 

-  Reconnect  if  possible,  and  the  neighbouring 
elements  have  not  been  reconnected  (scalar, 
integer); 

-  Remember  swapped  elements; 

-  ENDWHILE 

4.2  Removal  of  Bad  Elements 

A  straightforward  way  to  improve  a  mesh  contain¬ 
ing  bad  elements  is  to  get  rid  of  them.  For  tetrahe¬ 
dral  grids  this  is  particularly  simple,  as  the  removal  of 
an  internal  edge  does  not  lead  to  new  element  types 
for  the  surrounding  elements.  Once  the  bad  elements 
have  been  identified  (in  parallel),  they  are  compiled 
into  a  list  and  interrogated  in  turn.  An  element  is 
removed  by  collapsing  the  points  of  one  of  the  edges, 
as  shown  in  Figure  7. 

# 

Figure  7  Removal  of  Element 

This  operation  also  removes  all  the  elements  that 
share  this  edge.  It  is  advisable  to  make  a  check  which 
of  the  points  of  the  edge  should  be  kept:  point  1, 
point  2,  or  a  point  somewhere  on  the  edge  (e.g.  the 
mid-point).  This  implies  checking  all  elements  that 
contain  either  point  1  or  point  2.  This  procedure  of 
removing  bad  elements  is  simple  to  implement  and 
relatively  fast.  On  the  other  hand,  it  will  not  tend  to 
improve  the  overall  quality  of  the  mesh.  It  is  therefore 
used  mainly  in  a  pre-smoothing  or  pre-optimization 
stage,  where  its  main  function  is  to  eradicate  elements 
of  very  bad  quality  from  the  mesh. 

4.3  Laplacian  Smoothing 

A  number  of  smoothing  techniques  are  lumped  un¬ 
der  this  name.  The  edges  of  the  triangulation  are  as¬ 
sumed  to  represent  springs.  These  springs  are  relaxed 
in  time  using  an  explicit  time  stepping  scheme,  until 
an  equilibrium  of  spring-forces  has  been  established. 
Because  ‘globally’  the  variations  of  element  size  and 
shape  are  smooth,  most  of  the  non-equilibrium  forces 
are  local  in  nature.  This  implies  that  a  significant 
improvement  in  mesh  quality  can  be  achieved  rather 
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quickly.  The  force  exerted  by  each  spring  is  propor¬ 
tional  to  its  length  and  along  its  direction.  Therefore, 
the  sum  of  the  forces  exerted  by  all  springs  surround¬ 
ing  a  point  can  be  written  as: 

ns  i 

=cX](xi -Xi)  , 
j- 1 

where  c  denotes  the  spring  constant,  x,  the  coordi¬ 
nates  of  the  point,  and  the  sum  extends  over  all  the 
points  surrounding  the  point.  The  time-advancement 
for  the  coordinates  is  accomplished  as  follows: 

A  a  1  , 

Ax,  —  A t — f i  . 

nsi 

At  the  surface  of  the  computational  domain,  no  move¬ 
ment  of  points  is  allowed,  i.e.  Ax  =  0.  Usually, 
the  timestep  (or  relaxation  parameter)  is  chosen  as 
At  =  0.8,  and  5-6  timesteps  yield  an  acceptable 
mesh.  The  parallelization  of  the  Laplacian  smooth¬ 
ing  is  similar  to  that  of  any  other  field  solver.  The 
loops  over  the  elements/edges  are  colored  so  that 
cash  misses  are  minimized,  pipelining  does  not  lead 
to  memory  contingencies  and  cache-line  overwrite  is 
avoided  [Loh98b].  The  application  of  the  Laplacian 
smoothing  technique  will  yield  an  overall  improve¬ 
ment  of  grid  quality,  but  can  result  in  inverted  or 
negative  elements.  These  negative  elements  are  elim¬ 
inated.  It  has  been  found  advisable  to  remove  not 
only  the  negative  elements,  but  also  all  elements  that 
share  points  with  them.  This  element  removal  gives 
rise  to  voids  or  holes  in  the  mesh,  which  are  regridded 
using  the  advancing  front  technique. 

5.  IMPLEMENTATION 

The  procedure  described  above  was  implemented  us¬ 
ing  the  shared  memory  (i.e.  c$doacross)  paradigm 
on  the  SGI  Origin  2000.  Although  this  is  a 
distributed-memory  machine,  it  can  be  programmed 
as  a  shared-memory  machine.  This  choice  was 
adopted  for  the  following  reasons: 

-  Coding  for  a  shared-memory  environment  is 
much  simpler  than  for  a  distributed-memory  en¬ 
vironment.  The  operating  system  takes  care  of 
most  of  the  inherent  message  passing,  communi¬ 
cation  conflicts,  etc.,  relieving  the  user  from  this 
task; 

-  The  production  codes  used  in  conjunction  with 
the  grid  generator  scale  well  using  the  shared- 
memory  paradigm  [Loh98b].  Even  on  32  proces¬ 
sors,  more  than  50%  of  the  theoretical  speedup  is 
achieved.  Figure  8  shows  the  speedups  obtained 
for  a  steady  compressible  flow  problem  using  an 
edge-based,  upwind  solver  on  different  computer 
platforms.  Note  that  the  speeds  obtained  using 


the  shared  and  distributed  memory  paradigms 
are  comparable. 

-  For  the  last  years,  all  of  the  large-scale  produc¬ 
tion  runs  carried  out  by  the  author  and  his  col¬ 
leagues  [Loh98a,  Loh98c,  Bau98,  Bau99]  were 
performed  on  these  machines,  using  the  shared- 
memory  paradigm; 

-  The  SGI  Origin  2000  has  become  the  dominant 
platform  within  the  High-Performance  Comput¬ 
ing  sites  in  the  US;  at  present,  there  are  more  SGI 
Origin  2000  processors  that  those  of  all  other  ven¬ 
dors  combined;  the  expectation  is  that  this  trend 
will  not  change  drastically  in  the  foreseeable  fu¬ 
ture. 

While  some  of  the  reasons  stated  above  have  to  do 
with  particular  circumstances,  the  basic  ideas  of  the 
proposed  algorithms  are  general,  and  may  be  coded 
within  a  distributed  memory  framework  with  explicit 
message  passing. 


Figure  8  Performance  of  FEFLO  on  Different  Platforms 


6.  EXAMPLES 

The  proposed  parallel  advancing  front  scheme  has 
been  used  extensively  over  the  last  year  in  a  pro¬ 
duction  environment.  We  include  a  sampling  of  ge¬ 
ometries  gridded,  highlighting  the  characteristics  of 
the  proposed  scheme.  The  timings  for  all  examples 
include  surface  gridding,  mesh  generation  and  mesh 
improvement,  i.e.  they  can  considered  representative 
of  the  timings  obtained  in  a  production  environment. 
All  timings  were  obtained  on  SGI  Origin  2000  servers. 

6*1  Cube:  This  academic  example  is  included  here  to 
see  how  the  procedure  works  in  the  ‘best  case  sce¬ 
nario’.  The  unit  cube  is  to  be  gridded  with  a  uni¬ 
form  mesh  of  approximaterly  1  million  tetrahedra. 
Although  this  is  not  a  large  grid,  the  timings  shown 
in  Figure  9  are  illustrative. 
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Figure  9  Cube:  Speedups  Obtained 

6.2  Aneurism:  This  example  is  taken  from  a  hemody¬ 
namic  analysis  recently  conducted  for  this  particular 
geometry.  The  outline  of  the  domain,  grid,  as  well  as 
some  sample  results  are  shown  in  Figure  10.1. 


Figure  10.1  Aneurism 


Figure  10.2  Aneurism:  Speedups  Obtained 

The  final  uniform  mesh  had  approximately  2.7  mil¬ 
lion  tetrahedra.  The  speedup  obtained  is  shown  in 
Figure  10.2.  As  one  can  see,  the  parallel  mesher  was 
more  efficient  for  this  finer  grid  than  for  the  unit  cube, 
even  though  the  volume  to  be  gridded  in  each  box  can 


vary  widely  due  to  geometric  features. 

6.3  Garage:  This  example  was  taken  from  a  blast  sim¬ 
ulation  recently  carried  out  for  an  office  complex.  The 
outline  of  the  domain  is  shown  in  Figure  11.1. 


The  final  uniform  mesh  had  approximately  9.2  mil¬ 
lion  tetrahedra.  The  speedup  obtained  is  shown  in 
Figure  11.2.  As  before,  the  parallel  mesher  was  more 
efficient  for  this  finer  grid  than  for  the  unit  cube,  even 
though  the  volume  to  be  gridded  in  each  box  can  vary 
widely  due  to  geometric  features. 


Nr.  of  Processors 


Figure  11.2  Garage:  Speedups  Obtained 

6.4  Tysons  Corner:  This  example  was  taken  from 
a  dispersion  simulation  recently  carried  out  for  this 
well-known  shopping  center.  The  outline  of  the  do¬ 
main  is  shown  in  Figures  12.1,12.2. 


Figure  12.1  Tysons  Corner:  Wireframe  (W) 
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Figure  12.2  Tysons  Corner:  Wireframe  (N) 

The  final  mesh  had  approximately  16  million  tetrahe- 
dra.  The  smallest  and  largest  specified  element  side 
lengths  were  230  cm  and  1400  cm  respectively.  The 
speedup  obtained  is  shown  in  Figure  12.3. 
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Nr.  of  Processors 


Figure  12.3  Tysons  Corner:  Speedups  Obtained 

6.5  Space  Shuttle:  This  example  is  included  here  be¬ 
cause  it  is  typical  of  many  aerodynamic  data  sets. 
The  outline  of  the  domain  is  shown  in  Figure  13.1. 


Figure  13.1  Space  Shuttle:  Outline  of  Domain 


The  surface  triangulation  of  the  final  mesh,  which  had 
approximately  4  million  tetrahedra,  is  shown  in  Fig¬ 
ure  13.2.  The  smallest  and  largest  specified  element 
side  lengths  were  5.08  cm  and  467.00  cm  respectively, 


i.e.  an  edge-length  ratio  of  approximately  1  :  102  and 
a  volume  ratio  of  1  :  106.  The  spatial  variation  of  ele¬ 
ment  size  was  specified  via  approximately  200  sources 
[Loh96]. 


Figure  13.2  Space  Shuttle:  Surface  Mesh 


Nr.  of  Processors 


Figure  13.3  Space  Shuttle:  Speedups  Obtained 

The  speedup  obtained  is  shown  in  Figure  13.3.  Fi¬ 
nally,  the  results  of  an  Euler  run  for  an  incoming 
Mach-nr.  of  M  =  2.0  and  angle  of  attack  of  a  =  1.1® 
are  shown  in  Figure  13.4. 


Figure  13.4  Space  Shuttle:  Surface  Pressures 
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6.6  Pilot  Electing  From  F18:  Problems  with  mov¬ 
ing  bodies  often  require  many  remeshings  during  the 
course  of  a  simulation.  The  case  shown  here  was  taken 
from  a  pilot  ejection  simulation  recently  conducted. 
The  outline  of  the  domain  is  shown  in  Figure  14.1. 
The  surface  triangulation  of  the  final  mesh,  which  had 
approximately  14  million  tetrahedra,  is  shown  in  Fig¬ 
ure  14.2.  The  smallest  and  largest  specified  element 
side  lengths  were  0.65  cm  and  250.00  cm  respectively, 
i.e.  an  edge-length  ratio  of  approximately  1:4*  102 
and  a  volume  ratio  of  1  :  5.6  •  107.  The  spatial  vari¬ 
ation  of  element  size  was  specified  via  approximately 
110  sources  [Loh96]. 


Figure  14.1  F18  Pilot  Ejection:  Outline  of  Domain 


Figure  14.2  F18  Pilot  Ejection:  Surface  Mesh  (Closeup) 


Figure  14.3  F18  Pilot  Ejection:  Speedups  Obtained 


The  speedup  obtained  for  two  different  grid  sizes  is 
displayed  in  Figure  14.3. 

As  could  be  seen  from  the  previous  examples,  the  grid 
generator  certainly  scales  well  with  the  number  of  pro¬ 
cessors.  As  with  many  other  areas  where  parallel  pro¬ 
cessing  is  being  attempted,  scalability  improves  with 
the  amount  of  work  required.  The  larger  the  grids, 
the  better  the  scalability.  One  can  also  observe  that 
for  each  case  the  ‘scalability  slope*  is  somewhat  dif¬ 
ferent,  ranging  from  almost  perfect  for  the  garage  to 
a  1:3  asymptote  for  the  Shuttle.  Closer  inspection  re¬ 
vealed  that  for  the  Shuttle,  although  the  final  mesh 
contains  4  million  elements,  the  number  of  elements 
created  in  each  parallel  pass  over  increasing  element 
sizes  was  rather  modest,  never  exceeding  0.5  million 
elements.  Thus,  the  inherent  parallelism  of  this  prob¬ 
lem  is  rather  low. 

7.  CONCLUSIONS  AND  OUTLOOK 

A  parallel  advancing  front  scheme  has  been  devel¬ 
oped.  The  domain  to  be  gridded  is  first  subdivided 
spatially  using  a  relatively  coarse  octree.  Boxes  are 
then  identified  and  gridded  in  parallel.  A  scheme 
that  resembles  closely  the  advancing  front  technique 
on  scalar  machines  is  recovered  by  only  considering 
the  boxes  of  the  active  front  that  generate  small  ele¬ 
ments.  The  procedure  has  been  implemented  on  the 
SGI  Origin  class  of  machines  using  the  shared  mem¬ 
ory  paradigm.  Timings  for  a  variety  of  cases  show 
speedups  similar  to  those  obtained  for  flow  codes.  The 
procedure  has  been  used  to  generate  grids  for  a  large 
variety  of  cases,  and  is  nearing  production  maturity. 

Current  work  is  focusing  on  improved  work  prediction 
algorithms,  as  it  is  found  that  in  going  to  a  larger 
number  of  processors,  any  small  imbalance  incurs  a 
heavy  CPU  penalty. 
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ABSTRACT 

A  procedure  for  the  generation  of  highly  stretched  grids  suitable  for  Reynolds-Averaged  Navier-Stokes 
(RANS)  calculations  is  presented.  In  a  first  stage,  an  isotropic  (Euler)  mesh  is  generated.  In  a  second  stage, 
this  grid  is  successively  enriched  with  points  in  order  to  achieve  highly  stretched  elements.  The  element 
reconnection  is  carried  out  using  a  constrained  Delaunay  approach.  Points  are  introduced  from  the  regions 
of  lowest  stretching  towards  the  regions  of  highest  stretching.  The  procedure  has  the  advantages  of  not  re¬ 
quiring  any  type  of  surface  recovery,  not  requiring  extra  passes  or  work  to  mesh  concave  ridges/corners,  and 
guarantees  a  final  mesh,  an  essential  requirement  for  industrial  environments.  Given  that  point  placement 
and  element  quality  are  highly  dependent  for  the  Delaunay  procedure,  special  procedures  were  developed  in 
order  to  obtain  optimal  point  placement. 


I.  INTRODUCTION 

Many  problems  of  Computational  Mechanics  are 
characterized  by  a  very  strong  anisotropy  in  the  spa¬ 
tial  variation  of  the  fields  of  interest.  A  typical  exam¬ 
ple  in  fluid  mechanics  is  a  boundary  layer.  For  lam¬ 
inar  flow,  the  variations  in  the  streamwise  direction 
will  be  3-5  orders  of  magnitude  less  than  normal  to  it. 
The  same  applies  to  turbulent  flows  simulated  using 
the  Reynolds- Averaged  Navier-Stokes  (RANS)  equa¬ 
tions  with  suitable  turbulence  models.  The  reliable 
generation  of  high  quality  grids  for  RANS  simulations 
has  been  attempted  with  varying  degrees  of  success 
by  several  authors  during  the  last  decade  (Nakahashi, 
1987;  Kallinderis,  1992;  Lohner,  1993;  Pirzadeh,  1994; 
Marcum,  1995;  Pirzadeh,  1996;  Peraire  and  Mor¬ 
gan,  1996-1998).  Given  that  the  generation  of  highly 
stretched  grids  is  not  trivial,  and  that  computing 
power  is  increasing  steadily,  the  immediate  question 
that  comes  to  mind  is  when  RANS  simulations  will 
be  replaced  by  Large-Eddy  Simulations  (LES)  or  even 
Direct  Navier-Stokes  (DNS)  simulations.  Let  us  as¬ 
sume  an  optimal  mesh  for  LES  simulations.  This 
could  be  an  adaptive  Cartesian  grid  that  consists  of 
typical  (large)  Euler  cells  in  the  field  and  very  small 
cells  in  the  boundary  layers  in  order  to  capture  all 
relevant  scales.  Clearly,  most  points/cells  will  be  lo¬ 
cated  in  the  boundary  layer  regions.  Denoting  by  Np 
the  number  of  points  and  by  h  the  characteristic  cell 
size,  we  have: 


If  we  assume,  conservatively,  a  laminar  B-747  wing 
with  a  chord  Reynolds-nr.  of  Rem  =  106,  and  fur¬ 
thermore  assume  that  the  boundary  layer  obeys  the 
flat  plate  formula: 


filam  _  5.5 

x  ~~  y/Rex 

the  boundary  layer  thickness  will  be  approximately 
<5  «  5  •  10~3  m,  implying  an  (isotropic)  element  size 
of  at  most  h  «  5  •  10"4  m.  The  resulting  number  of 
gridpoints  is  then: 


^  _  N$  •  Awjng  _  10  •  250  7W?  _ 10*^ 

P  ~  ft2  ~  (5  •  10-4  mf  ~ 

Current  RANS  production  runs  operate  with  Np  = 
107.  From  Moore’s  Law  (the  doubling  of  computing 
power  every  18  months),  we  can  foresee  LES  grids  in 
15  years.  As  far  as  CPU  is  concerned,  the  number  of 
timesteps  Nt  required  to  advect  a  particle  accurately 
across  the  wing  is  proportional  to  the  number  of  cells, 
i.e. 


Nt 


5m 

5  •  10~4m 


=  O(104) 


Assuming  the  number  of  floating  point  operations  per 
point  per  timestep  to  be  Njpp  =  O(103),  this  would 
result  in  an  operation  count  of 
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Nops  =  O(1010)  ■  O(104)  •  O(103)  =  O(1017) 

Given  that  the  limit  of  human  patience  lies  some¬ 
where  around  0(1O3)  sec ,  the  operation  count  ob¬ 
tained  above  implies  a  CPU  performance  requirement 
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of  O(1014)  FLOPS.  Current  production  runs  can  op¬ 
erate  at  Np  =  1011  (100  GFLOPS).  Invoking  once 
more  Moore’s  Law,  we  can  foresee  LES  runs  in  15 
years.  If  we  perform  a  sensitivity  analysis,  we  note 
that  the  only  linear  component  in  these  numbers  was 
the  human  patience  (e.g.  lhr  to  1  day).  As  soon  as  we 
increase  grid  resolution  by  a  factor  of  10,  we  increase 
the  number  of  points  by  103,  the  number  of  timesteps 
by  10,  and  the  total  effort  by  104,  i.e.  we  have  to  wait 
yet  another  20  years  before  we  can  carry  out  such 
a  simulation.  Faced  with  the  pressing  and  immedi¬ 
ate  need  to  compute  flows  where  Reynolds-number 
effects  are  important,  we  see  that  the  optimal  RANS 
gridding  of  geometrically  complex  domains  is  still  an 
important  topic  of  research. 

2.  THE  RANS  GRIDDING  TECHNIQUE 

The  generation  of  isotropic  unstructured  grids  has 
reached  a  fairly  mature  state,  as  evidenced  by  the 
many  publications  that  have  appeared  over  the  last 
decade  on  this  subject  (Baker,  1987;  Lohner  and 
Parikh,  1988;  Peraire  et  al.,  1988;  George  et  ah, 
1990;  Joe,  1991;  George,  1991;  George  and  Herme- 
line,  1992;  Lohner  et  al.,  1992;  Weatherill,  1992; 
Mavriplis,  1993;  Jin  and  Tanner,  1993;  Weatherill 
and  Hassan,  1994;  Marcum  and  Weatherill,  1995; 
Shostko  and  Lohner,  1995,  Lohner,  1996)  and  the 
widespread  use  of  unstructured  grids  in  industry.  The 
two  most  widely  used  techniques  are  the  advancing 
front  technique  (Lohner  and  Parikh,  1988;  Peraire  et 
al.,  1988;  Lohner  et  al.,  1992;  Jin  and  Tanner,  1993; 
Shostko  and  Lohner,  1995,  Lohner,  1996)  and  the  De¬ 
launay  triangulation  (Baker,  1987;  Joe,  1991;  George, 
1991;  George  and  Hermeline,  1992;  Weatherill,  1992; 
Weatherill  and  Hassan,  1994;  Marcum  and  Weather¬ 
ill,  1995).  Hybrid  schemes,  that  combine  an  advanc¬ 
ing  front  point  placement  with  the  Delaunay  recon¬ 
nection  have  also  been  used  successfully  (Merriam, 
1991;  Mavriplis,  1993;  Muller,  1993).  These  isotropic 
mesh  generation  techniques  have  been  used  to  gen¬ 
erate  grids  with  mildly  stretched  elements  within  the 
context  of  adaptive  remeshing  (Peraire,  1987;  Lohner, 
1988;  Tilch,  1991;  Habashi,  1998).  However,  they 
fail  when  attempting  to  generate  highly  stretched 
elements,  a  key  requirement  for  Reynolds- Averaged 
Navier  Stokes  (RANS)  calculations  with  turbulence 
models  that  reach  into  the  sublayer. 

A  number  of  specialized  schemes  have  been  pro¬ 
posed  to  remedy  this  situation  (Nakahashi,  1987; 
Kallinderis,  1992;  Lohner,  1993;  Pirzadeh,  1994; 
Marcum,  1995;  Pirzadeh,  1996;  Peraire  and  Mor¬ 
gan,  1996).  The  domain  to  be  gridded  was  di¬ 
vided  into  isotropic  and  stretched  element  regions. 
In  addition,  a  blending  procedure  to  transition 
smoothly  between  these  zones  was  provided.  Typi¬ 
cally,  the  stretched  mesh  region  was  generated  first 


(Kallinderis,  1992;  Lohner,  1993;  Pirzadeh,  1994; 
Marcum,  1995;  Pirzadeh,  1996).  Although  we  have 
used  such  a  scheme  (Lohner,  1993)  for  a  number  of 
years,  we  have  found  several  situations  in  which  the 
requirement  of  a  semi-structured  element  or  point 
placement  close  to  wetted  surfaces  is  impossible, 
prompting  us  to  search  for  a  more  general  technique. 
This  procedure  may  be  summarized  as  follows: 

-  Generate  an  isotropic  mesh;  this  can  be  done 
with  any  unstructured  grid  generator; 

-  Remove  all  points  in  regions  where  stretched  el¬ 
ements  are  to  be  generated; 

-  Using  a  constrained  Delaunay  technique,  intro¬ 
duce  points  in  order  to  generate  highly  stretched 
elements; 

-  Introduce  the  points  in  ascending  levels  of 
stretching,  i.e.  from  the  domain  interior  to  the 
boundary. 

This  procedure  has  the  following  advantages: 

-  No  surface  recovery  is  required  for  the  Delaunay 
reconnection,  eliminating  the  most  problematic 
part  of  this  technique; 

-  Proper  meshing  of  concave  ridges/corners  is  ob¬ 
tained; 

-  The  meshing  of  concave  ridges/corners  requires 
no  extra  work;  this  important  advantage  is  shown 
in  Figure  1; 

-  Meshing  problems  due  to  surface  curvature  are 
minimized; 

-  In  principle,  no  CAD  representation  of  the  sur¬ 
face  is  required;  and 

-  A  final  mesh  is  guaranteed,  an  essential  require¬ 
ment  for  automation. 

The  disadvantages  are  the  following: 

-  As  with  any  Delaunay  technique,  the  mesh  qual¬ 
ity  is  extremely  sensitive  to  point  placement. 


Figure  1  Introduction  of  Points  at  Corners 


3.  INSERTION  OF  POINTS 

The  insertion  of  points  is  carried  out  using  the  con¬ 
strained  Delaunay  procedure  (George,  1991),  which 
may  be  summarized  as  follows.  Given  a  new  point  i 
at  location  x, : 

-  Find  the  element(s)  x,  falls  into; 
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-  Obtain  all  elements  whose  circumsphere  encom¬ 
passes  Xj; 

-  Remove  from  this  list  of  elements  all  those  that 
would  not  form  a  proper  element  (volume,  an¬ 
gles)  with  x,*;  this  results  in  a  properly  con¬ 
strained  convex  hull; 

-  Reconnect  the  outer  faces  of  the  convex  hull  with 
Xf  to  form  new  elements. 

The  procedure  has  been  sketched  in  Figure  2. 


Figure  2  Introduction  of  Field  Points 


For  boundary  points  some  additional  steps  are  re¬ 
quired.  Given  a  new  point  boundary  point  i  at  lo¬ 
cation  x,*: 

-  Determine  if  the  point  is  on  a  boundary  edge  or 
face; 

-  Reconnect  these  elements  without  regard  to  the 
Delaunay  criterion; 

-  Find  the  element(s)  x,*  falls  into; 

-  Obtain  all  elements  whose  circumsphere  encom¬ 
passes  x*; 

-  Remove  from  this  list  of  elements  all  those  that 
would  not  form  a  proper  element  (volume,  an¬ 
gles)  with  xt;  this  results  in  a  properly  con¬ 
strained  convex  hull; 

-  Reconnect  the  boundary  faces  (see  Figure  3); 

-  Reconnect  the  outer  faces  of  the  convex  hull  with 
x,*  to  form  new  elements. 

The  reconnection  of  boundary  faces  is  carried  out  by 
diagonal  swapping.  For  curved  surfaces,  it  is  neces¬ 
sary  to  apply  angle  constrains  in  order  not  to  lose  sur¬ 
face  resolution/ definition  or  surface  patch  integrity. 

4  Original  Surfac*  Triangulation  b)  Inwwt  Point  A*  Is  c)  Reconnect  Boundary  Ficm 


Figure  3  Introduction  of  Surface  Points 


The  points  are  inserted  according  to  layers  following 
the  normals  emanating  from  ‘wetted’  surface  points, 
starting  from  the  outermost  layer,  and  moving  to¬ 
wards  the  boundaries  or  wake  centerlines.  Points  are 
only  introduced  if  the  spacing  normal  to  the  wall  is 
below  a  fraction  of  the  isotropic  element  size  speci¬ 
fied  by  the  user  at  the  particular  location.  This  is 


important  for  grids  with  a  large  variation  of  element 
size,  and  produces  a  smooth  transition  from  the  Euler 
region  into  the  RANS  region. 

4.  ADDITION  OF  EXTRA  SURFACE  POINTS 

For  complex  geometries  with  narrow  surface  strips 
close  to  concave  edges,  it  is  not  possible  to  obtain 
a  good  surface  mesh  unless  one  introduces  further 
points  in  these  regions.  A  typical  situation  is  shown 
in  Figure  4. 


These  additional  points  are  introduced  by  identify¬ 
ing  the  corners  where  potential  problems  can  appear. 
These  are  typically  concave,  and  are  characterized  by 
normals  that  would  introduce  points  close  to  another 
edge.  Subsequent  reconnection  using  the  constrained 
Delauney  technique  would  yield  elements  with  very 
large  angles.  In  order  to  avoid  this,  additional  points 
are  introduced  along  the  concave  edge  to  mitigate  this 
topological  effect. 

5.  CONSTRUCTION  OF  NORMALS 

The  insertion  of  points  to  construct  highly  stretched 
elements  is  carried  out  along  normals  that  may  start 
either  on  the  boundary  (boundary  layers)  or  in  the 
field  (wakes).  The  number  of  normals  emanating  from 
a  surface  point  can  vary,  depending  on  whether  we 
have  a  convex  or  concave  surface.  Figure  5  shows 
just  a  few  of  a  large  class  of  cases  that  have  to  be 
considered. 
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6.  REMOVAL  OF  POINTS  BEFORE  RANS  GRIDDING 

Since  the  quality  of  grids  generated  using  the  Delau¬ 
nay  technique  is  very  sensitive  to  point  placement,  it 
is  advisable  to  remove  any  (isotropic)  points  that  may 
interfere  with  the  semi-structured  points  in  the  highly 
stretched  regions.  The  regions  where  this  could  hap¬ 
pen  can  be  obtained  before  starting  the  RANS  mesh¬ 
ing.  The  point  removal  algorithm  may  be  summarized 
as  follows: 


-  For  each  point  marked  for  removal: 

-  Obtain  all  edges  touching  this  point,  and  the 
corresponding  neighbouring  points; 

-  Remove  the  edges  that  can  not  be  collapsed 
due  to  topological  considerations  (end-,  line- 
,  surface-  or  volume  points); 

-  Remove  the  edges  that,  if  collapsed,  would 
lead  to  small  or  negative  elements; 

-  Remove  the  edges  that,  if  collapsed,  would 
lead  to  very  small  or  very  large  angles; 

-  Order  the  remaining  edges  according  to  their 
length; 

-  Remove  the  marked  edges,  renumbering  the 
elements; 

-  Compress  and  renumber  the  element  and  point 
lists. 

The  effectiveness  of  this  point  removal  algorithm  may 
be  enhanced  by  combining  it  with  edge  and  face  swap¬ 
ping  and  point  movement.  The  final  point  removal 
algorithm  then  takes  the  form  of  the  following  loop: 


-  DO:  For  a  maximum  number  of  passes: 

-  Remove  marked  points  by  edge-collapse; 

-  IF:  Points  could  be  removed: 

-  Perform  edge  and  face  swapping; 

-  ELSE 

-  Move  points  that  could  not  be  removed; 

-  Perform  edge  and  face  swapping; 

-  END IF 

-  ENDDO 

The  advantage  of  moving  points  can  be  seen  from  the 
small  example  shown  in  Figure  6.  Any  edge  collapse 
for  the  point  in  the  center  of  the  domain  would  lead 
to  elements  with  vanishing  volumes.  Mesh  movement 
in  combination  with  diagonal  swapping  removes  this 
quandary.  This  combination  of  point  movement  and 
edge  swapping  is  very  efficient,  typically  leaving  only 
0.1%  of  the  points  marked  for  removal  in  the  mesh. 


a)  b) 


7.  POINT  INTRODUCTION  FOR  GAPS 

In  narrow  gaps,  the  introduction  of  points  from  two 
close  surfaces  covered  with  RANS  grids  can  lead  to 
very  poor  mesh  quality.  Figure  7  shows  an  example 
where  the  resulting  mesh  is  clearly  inappropriate  for 
RANS  calculations.  In  order  to  avoid  the  introduc¬ 
tion  of  points  in  such  regions,  the  host  element  into 
which  the  new  point  falls  is  checked  for  proximity  to 
another  surface.  If  any  of  the  points  of  this  element 
are  on  the  surface  and  are  too  close  to  the  new  point, 
the  new  point  is  rejected.  We  remark  that  the  De- 
launey  reconnection  procedure  requires  the  determi¬ 
nation  of  the  host  element,  so  that  this  check  incurs 
only  a  modest  amount  of  CPU. 


Figure  7  Gridding  of  Gaps 


8.  EXAMPLES 

The  described  RANS  gridding  procedure  has  been  op¬ 
erational  for  approximately  a  year,  and  was  used  to 
mesh  a  series  of  test  cases.  Four  of  these  are  included 
here. 

a)  Nose- Cone:  The  surface  mesh  of  the  isotropic  mesh 
is  shown  in  Figure  8.1.  The  removal  of  points  within 
the  future  non-isotropic  layers  of  elements  results  in 
the  surface  mesh  shown  in  Figure  8.2.  The  final  sur¬ 
face  mesh  is  depicted  in  Figure  8.3.  One  can  see  the 
considerable  stretching  achieved. 
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b)  Ship:  This  case  shows  a  RANS  grid  for  the  generic 
chemical  tanker  shown  in  Figure  9.1.  Figures  9.2  and 
10.1-10.3  show  surface  grids  and  cross-sections  for  the 
generated  mesh.  The  isotropic  (Euler)  mesh  for  this 
case  had  approximately  1.2  million  elements,  while 
the  final,  anisotropic  (RANS)  mesh  had  close  to  5  mil¬ 
lion  elements. 

c)  Flyer:  The  third  configuration  considered  is  that  of 
a  generic  hypersonic  flyer.  Figures  11.1-11.2  show  the 
surface  grids  obtained,  and  Figures  12.1-12.2  show  a 
plane  cut  in  the  region  of  the  vertical  tail  and  sta¬ 
bilizer.  Observe  the  proper  gridding  in  the  corner 
regions.  The  isotropic  (Euler)  mesh  for  this  case 
had  approximately  2  million  elements,  while  the  final, 
anisotropic  (RANS)  mesh  had  approximately  6  mil¬ 
lion  elements.  On  an  SGI  Origin  2000,  using  1  R10000 
processor,  the  isotropic  grid  generation  took  approxi¬ 
mately  45  minutes,  while  the  anisotropic  enrichment 
took  approximately  40  minutes.  One  can  see  from 
these  figures  that  the  speed  of  the  proposed  RANS 
gridding  technique  is  acceptable. 

d)  Racecar:  The  fourth  configuration  considered  is 
that  of  a  generic  racecar.  Figures  13.1-13.2  show 
the  surface  definition  and  the  overall  surface  grid, 
while  Figures  14-15  focus  on  particular  regions  of  the 
mesh  (driver,  nose  of  the  car,  back  of  the  car,  gaps, 
etc.),  which  had  approximately  8.3  million  elements. 
A  result  for  a  flow  simulation  using  Luo’s  implicit 
LU-SGS-GMRES  solver  (Luo  1998)  is  shown  in  Fig¬ 
ures  16.1-16.2,  demonstrating  the  usefulness  of  the 
procedure. 

9.  CONCLUSIONS  AND  OUTLOOK 

A  procedure  for  the  generation  of  highly  stretched 
grids  suitable  for  Reynolds-Averaged  Navier-Stokes 
(RANS)  calculations  has  been  developed.  In  a  first 
stage,  an  isotropic  (Euler)  mesh  is  generated.  In  a 
second  stage,  this  grid  is  successively  enriched  with 
points  in  order  to  achieve  highly  stretched  elements. 
The  element  reconnection  is  carried  out  using  a  con¬ 
strained  Delaunay  approach.  Points  are  introduced 
from  the  regions  of  lowest  stretching  towards  the  re¬ 
gions  of  highest  stretching.  The  procedure  has  the 
advantages  of  not  requiring  any  type  of  surface  re¬ 
covery,  not  requiring  extra  passes  or  work  to  mesh 
concave  ridges/corners,  and  guarantees  a  final  mesh, 
an  essential  requirement  for  industrial  environments. 
Given  that  point  placement  and  element  quality  are 
highly  dependent  for  the  Delaunay  procedure,  special 
procedures  are  required  in  order  to  obtain  optimal 
point  placement.  Among  these,  the  most  important 
is  the  removal  of  points  that  fall  into  the  RANS  zone 
before  enriching  the  mesh. 

Several  examples  demonstrate  the  usefulness  of  the 
proposed  anisotropic  gridding  technique.  Timings 


from  grids  generated  to  date  show  that  the  procedure 
is  faster  than  traditional  advancing  front  techniques 
for  isotropic  grids,  implying  that  the  generation  of 
anisotropic  grids  does  not  place  an  extra  burden  on 
CPU  or  memory  requirements. 

Near-future  work  will  center  on: 

-  Improvements  in  point  placement  and  element 
control; 

-  Automatic  wake  placement  and  topological  con¬ 
nection  to  CAD  models;  and 

-  Parallelization. 

Looking  further  into  the  future,  we  envision  fully  au¬ 
tomatic  RANS  gridders  integrated  into  a  multidis¬ 
ciplinary,  database-linked  framework  that  is  accessi¬ 
ble  anywhere  on  demand,  simulations  with  unprece¬ 
dented  detail  and  realism  carried  out  in  fast  succes¬ 
sion,  and  first-principles  driven  virtual  reality. 
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Figure  8  Surface  of:  Isotropic  Mesh,  After  Element  Removal  and  Final  Mesh 


Figure  9  Computational  Domain  and  Surface  Mesh  Along  the  Side 


Figure  10  Transversal  Cut,  Stern  Region  and  Bow  Region 
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Figure  11  Generic  Hypersonic  Flyer:  Surface  Grid  and  Detail 


Figure  12  Generic  Hypersonic  Flyer:  Planar  Cut  and  Detail 


Figure  13  Generic  Racecar:  Surface  Definition  and  Surface  Mesh 
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Figure  14  Driver  and  Nose  of  the  Car 


Figure  15  Details  of  Back,  Back-Wheel  and  Car-Floor  Gap 


Figure  16  Pressure  Contours  and  Streamlines 


9 


APPENDIX  4:  FLUID-STRUCTURE-THERMAL  INTERACTION 


19 


AIAA-98-241 9 


Fluid-Structure-Thermal  Interaction  Using 
A  Loose  Coupling  Algorithm 
And  Adaptive  Unstructured  Grids 

Rainald  Lohner1  Chi  Yang1,  Juan  Cebral1, 
Joseph  D.  Baum  ,  Hong  Luo , 

Daniele  Pelessone3  and  Charles  Charman? 

^George  Mason  University,  Fairfax,  VA 
gSAIC,  McLean,  VA 
General  Atomics,  San  Diego,  CA 


29th  AIAA  Fluid  Dynamics  Conference 

June  15-18,  1998  /  Albuquerque,  NM 


For  permission  to  copy  or  republish,  contact  the  American  Institute  of  Aeronautics 
1801  Alexander  Bell  Drive,  Suite  500,  Reston,  VA  20191 


and  Astronautics 


AIAA-98-2419 


FLUID-STRUCTURE-THERMAL  INTERACTION  USING 
A  LOOSE  COUPLING  ALGORITHM 
AND  ADAPTIVE  UNSTRUCTURED  GRIDS 

Rainald  Lohner1,  Chi  Yang1,  Juan  Cebral1, 

Joseph  D.  Baum2,  Hong  Luo2, 

Daniele  Pelessone3  and  Charles  Charman3 

1GMU/CSI,  George  Mason  University,  Fairfax,  VA  22030,  USA 
2 Science  Applications  International  Corporation 
1710  Goodridge  Drive,  MS  2-3-1,  McLean,  VA  22102,  USA 
3 General  Atomics,  San  Diego,  CA  92121,  USA 


ABSTRACT 

We  present  a  loosely  coupled  algorithm  to  combine  Computational  Fluid  Dynamics  (CFD),  Computational 
Structural  Dynamics  (CSD)  and  Computational  Thermo-Dynamics  (CTD)  codes  in  order  to  solve,  in  a  cost- 
effective  manner,  fluid-structure-thermal  interaction  problems.  The  basic  fluid-,  structural-  and  thermo¬ 
dynamics  codes  are  altered  as  little  as  possible.  The  structure  is  used  as  the  ‘master-surface’  to  define  the 
extent  of  the  fluid  region,  and  the  fluid  is  used  as  the  ‘master-surface’  to  define  the  mechanical  and  thermal 
loads.  The  transfer  of  loads,  heat  fluxes,  displacements,  velocities  and  temperatures  is  carried  out  via  fast 
interpolation  and  projection  algorithms.  As  shown,  this  fluid-structure-thermal  algorithm  can  be  interpreted 
as  an  iterative  solution  to  the  fully-coupled,  large  matrix  problem  that  results  from  the  discretization  of  the 
complete  problem.  Results  from  steady  supersonic  flow  and  shock-structure  interaction  problems  indicate 
that  the  proposed  approach  offers  a  convenient  and  cost-effective  way  of  coupling  CFD,  CSD  and  CTD  codes 
without  a  complete  re-write  of  them. 


1.  INTRODUCTION 

The  advent  of  supercomputing  over  the  last  decade 
has  led  to  fairly  mature  codes  in  all  core  disciplines 
of  Computational  Physics.  Due  to  their  immediate 
importance  to  industry,  a  large  number  of  public- 
domain  and  commercial  codes  exists  for  the  simula¬ 
tion  of  Computational  Fluid  Dynamics  (CFD),  Com¬ 
putational  Structural  Dynamics  (CSD)  and  Compu¬ 
tational  Thermo- Dynamics  (CTD)  [Cod98].  There 
exist  large  classes  of  important  engineering  problems 
that  require  the  concurrent  application  of  CFD,  CSD 
and  CTD  techniques.  Some  examples  are: 

-  Explosive  inflation,  e.g.  for  airbags,  where  flow- 
field,  fabric  loading  and  fabric  temperature  are 
closely  linked; 

-  Hypersonic  Flight,  where  the  deformation  of 
the  structure  due  to  aerodynamic  and  aerother- 
mal  loads  is  such  that  a  significant  variation  of 
the  flowfield  takes  place  (shock  location,  surface 
heating,  etc.),  and 

-  Process  modeling,  where  the  heat  due  to  shear 
and  chemical  reactions  in  the  fluid  gives  rise  to 
non-uniform  temperature  distributions  in  the 
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solid,  resulting  in  thermal  stress  loads  and  the 
associated  fatigue. 

-  Variable  Geometry  Vehicles,  where  the  change 
of  geometry  implies  a  transient  phase  in  which 
structures  and  flowfields  are  interacting  strongly, 
and,  in  most  cases,  non-linearly. 

Most  of  these  problems  are  presently  solved  either 
iteratively,  i.e.  making  several  cycles  of  ‘CFD  run 
followed  by  CTD  run  followed  by  CSD  run’,  or  by 
assuming  that  the  CFD,  CSD  and  CTD  problems 
can  be  decoupled  ‘to  first  order’.  Throughout  most 
large  manufacturing  companies  the  respective  CFD, 
CSD  and  CTD  runs  are  performed  in  different  divi¬ 
sions  which  may  be  geographically  dispersed,  leading 
to  time-delays,  loss  of  information,  and,  most  impor¬ 
tantly,  loss  of  insight. 

The  need  to  solve  fluid-structure-thermal  interaction 
problems  has  prompted  a  number  of  developments  in 
this  field  in  recent  years.  The  best  way  to  sort  these 
efforts  is  by  classifying  them  according  to  the  physical 
and  numerical  complexity  employed  for  the  fluid  and 
structure  respectively  (se4e  Figure  1).  For  the  fluid, 
the  PDEs  solved  are,  in  increasing  order  of  physical 
complexity: 

FI.  Laplace/Helmholtz  Operators  (inviscid,  irrota- 
tional,  isentropic  flow), 
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F2.  Non-Linear  Laplace  Operators  (inviscid,  irrota- 
tional  flow), 

F3.  Euler  Equations  (inviscid  flow), 

F4.  Reynolds-Averaged  Navier-Stokes  Equations 
(viscous,  time-averaged  flow), 

F5.  Large-Eddy  Simulations  (viscous  flow  with 
spatio-temporal  cut-off),  and 
F6.  Navier-Stokes  Equations. 

Each  of  these  approximations  requires  between  one 
and  two  orders  of  magnitude  more  CPU-time  and 
memory  than  the  preceding  one.  For  the  linear 
case,  boundary  element  methods  may  be  employed, 
whereas  all  other  approximations  are  typically  ap¬ 
proximated  on  a  grid  with  spatial  discretizations  ob¬ 
tained  from  Finite  Difference,  Finite  Volume,  Finite 
Element,  or  Spectral  Element  techniques. 

For  the  structure,  the  PDEs  solved  are,  in  increasing 
order  of  physical  complexity: 

51.  6  Degrees  of  Freedom  Integration  (rigid  body), 

52.  Linear  Elastic  Models,  either  through 

a)  A  Modal  Decomposition,  or 

b)  A  Finite  Element  Discretization, 

53.  Elasto-Plastic  Models,  and 

54.  Elasto-Plastic  Models  with  Contact,  Rupture, 
etc. 

As  before,  each  of  these  approximations  requires  be¬ 
tween  one  and  two  orders  of  magnitude  more  CPU¬ 
time  and  memory  than  the  preceding  one.  For  struc¬ 
tures,  the  spatial  discretization  is  typically  carried  out 
using  Finite  Element  techniques  [Zie88]. 

For  the  thermodynamic  description  of  the  problem, 
the  PDEs  solved  are,  in  increasing  order  of  physical 
complexity: 

Tl.  Adiabatic  or  Isothermal  Walls  and  Sources/ 
Sinks, 

T2.  Linear  Heat  Conduction,  either  through 

a)  A  Network  Decomposition  [Gas87]  or 

b)  A  Finite  Element  Discretization  [Zie88, 
Pro92,  Loh94],  and 

T3.  Nonlinear  Heat  Conduction  [Lbh94]. 

As  before,  each  of  these  approximations  requires  be¬ 
tween  one  and  two  orders  of  magnitude  more  CPU¬ 
time  and  memory  than  the  preceding  one.  When  solv¬ 
ing  the  PDE’s  describing  the  heat  flux,  the  spatial  dis¬ 
cretization  is  usually  carried  out  using  Finite  Element 
techniques  [Zie88]. 

A  major  characteristic  of  fluid-structure-thermal  al¬ 
gorithms  is  the  requirement  to  combine  the  discretiza¬ 
tions  for  the  fluid  and  the  structure.  This  provides  a 
fourth  classification  item  (see  Figure  2): 

11.  Same  surface/volume  discretization; 

12.  Different  surface/ volume  discretization  coupled 
via: 

a)  Interpolation, 

b)  Least-Squares, 

c)  Lagrange  Multipliers,  or 


d)  A  Third,  so-called  ‘Virtual’  Surface  Grid. 
For  the  simple  CSD  approximations  SI,S2a,  there  is 
no  discretization  of  the  structure  per  se,  so  that  the 
transfer  of  information  between  fluid  and  structure  is 
straightforward. 

With  this  series  of  possibilities,  we  are  now  in  a  po¬ 
sition  to  classify  previous  fluid-structure-thermal  in¬ 
teraction  work.  The  three  classic  fields  that  com¬ 
bine  two  of  these  disciplines:  structural  acoustics 
and  aeroelasticity  for  fluid-structure  interaction  and 
thermal  stress  analysis  for  structures  have  seen  the 
largest  amount  of  activity,  particularly  in  those  in¬ 
stances  where  the  fluid,  structure  and  heat  transfer 
were  assumed  as  linear  (inviscid,  irrotational,  isen- 
tropic  fluid,  linear  elastic  structure,  linear  heat  con¬ 
duction).  Of  the  many  references  in  aeroelastics,  we 
mention  [Jac87,  Bat88,  Gur90,  Eve90,  Eve91,  Bos93, 
Rau93,  Fel93,  Gur93,  Alo95],  For  thermoelastics,  see 
[Tho88,  Tam97].  To  our  knowledge,  the  earliest  ef¬ 
fort  to  simulate  fully  coupled  fluid/  structure/  ther¬ 
mal  problem  is  the  work  of  Thornton  [Tho88],  which 
was  carried  out  in  2-D. 

The  present  effort  is  directed  towards  nonlinear  appli¬ 
cations,  in  particular  structures  that  undergo  defer-, 
mations  due  to  aerodynamic  or  aero-thermodynamic 
loads.  For  this  reason,  we  start  immediately  with  the 
Euler  and  Reynolds- Averaged  Navier-Stokes  equa¬ 
tions  for  the  fluid,  and  either  the  linear  elastic 
(for  aerospace  flight  vehicles)  or  nonlinear,  large- 
deformation  equations  (for  impact  simulations)  for 
the  structure.  Given  that  the  geometrical  complex¬ 
ity  of  the  problems  targeted  for  simulation  can  be 
severe  and  the  deformation  considerable,  automatic 
grid  generation  is  a  prime  requirement.  For  this  rea¬ 
son,  unstructured  grids  are  employed  for  the  fluid  and 
the  structures.  The  elements  used  for  the  fluid  are 
tetrahedral,  whereas  the  elements  for  the  structure 
are  typically  trusses,  beams,  quads  or  bricks. 

The  remainder  of  the  paper  is  organized  as  follows: 
Section  2  describes  the  coupling  strategy  used.  The 
individual  codes  chosen,  FEFL098  for  the  fluid, 
COSMIC-NASTRAN  ,  DYNA3D  for  the  solid 
region,  and  COSMIC-NASTRAN  for  the  heat 
transfer  in  solids,  are  briefly  described  in  Section  3. 
Sections  4-6  discuss  surface  to  surface  interpolation, 
surface  tracking  and  load  transfer  techniques.  In  Sec¬ 
tion  7,  some  demonstration  runs  are  shown.  Finally, 
conclusions  and  an  outlook  for  future  development  are 
given  in  Section  8. 

2.  COUPLING  ALGORITHM 

When  trying  to  compare  the  possible  coupling  algo¬ 
rithms,  it  is  useful  to  start  from  the  basic  discrete 
equation  systems  obtained  for  the  thermal,  solid  and 
fluid  regions.  In  the  following,  we  assume,  without 
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loss  of  generality,  that  some  form  of  spatial  and  tem¬ 
poral  discretization  has  been  carried  out  for  the  in¬ 
dividual  sub-disciplines  and  subdomains.  The  time¬ 
marching  scheme  is  then  cast  into  a  system  of  equa¬ 
tions  for  the  increments  of  the  unknowns  in  time,  i.e. 
in  so-called  A-form.  For  the  thermal  field  in  the 
solid  region,  we  obtain  a  system  of  equations  of  the 
form: 


LHST  AT  = 


(sc+9K) 


•  AT  =  fj  +  f* 


.  (1) 


where 

At,C,T,  K,f  denote,  respectively,  the  timestep,  the 
heat  capacitance  matrix,  nodal  temperature  vector, 
heat  conduction  matrix  and  the  (internal  and  exter¬ 
nal)  thermal  loads  vector,  and  we  have  lumped  the 
left-hand-side  matrix  into  the  expression  LHST.  By 
splitting  the  degrees  of  freedom  into  those  touching 
the  fluid  region  ('/’),  and  the  remaining  ones  (‘t’),  we 
obtain 


f  LHST//  LHST/t  1  /AT/)  _ 

\  LHST tj  LHST,,  J  \AT,  /  — 

(»'♦(*)*♦  (L-(*i“*))  ■ 

where  the  superscripts  i,e  denote  internal  (conduc¬ 
tion)  and  external  thermal  loads  respectively,  L  is 
the  load  matrix  and  q/  the  vector  of  heat  loads  on 
the  surface  due  to  the  fluid.  For  the  solid  region, 
we  obtain  a  system  of  equations  of  the  form: 


LHSS  •  AX  =  (aM,  +  /?D  +  7K)  AX  = 

f;+f/  +  0Af,e  +  Q(T  +  0AT)  ,  (3) 

where  M,,f*,f/,D,K,X,X,Q,T  denote,  respec- 
tively,  the  mass-matrix,  interal  (stiffness,  damping, 
inertia)  and  external  force  vector,  damping  matrix, 
stiffness  matrix,  displacement  vector,  velocity  vec¬ 
tor,  thermal  stress  matrix  and  nodal  temperatures, 
a,  P,  7, 0  depend  on  the  timestep  and  the  timestep¬ 
ping  scheme  chosen,  and  we  have  lumped  the  left- 
hand-side  matrix  into  the  expression  LHSS.  By  split¬ 
ting  the  degrees  of  freedom  into  those  touching  the 
fluid  region  (*/’),  and  the  remaining  ones  (‘s’),  we 
obtain 


f  LHSS//  LHSS/,  1  (AXf\_ 
\  LHSS,/  LHSS,,  /  UxJ- 


t)‘+(fO'+(L(n6Asj))+ 


(  Q/(T  +  ©AT)\ 
VQ,(T  +  0AT)  J 


(4) 


where  the  superscripts  i,  e  denote  internal  (stiffness, 
damping,  inertia)  and  external  forces  respectively,  L 
is  the  load  matrix  and  s /  the  fluid  stresses  (pressures, 
shear  stresses)  on  the  surface.  For  the  fluid  region, 
we  obtain  a  system  of  equations  of  the  form: 


LHSF  AU  =  (^M/+0/j)  AU  =  r'  +  fe  ,  (5) 

where  M/,  J,f,  U  denote,  respectively,  the  mass  ma¬ 
trix,  Jacobian  of  the  discretized  fluxes,  internal  and 
external  forces,  and  we  have  lumped  the  left-hand- 
side  matrix  into  the  expression  LHSF.  By  splitting 
the  degrees  of  freedom  into  those  touching  the  solid 
region  (V),  and  the  remaining  ones  (*/’),  we  obtain 

f  LHSF,,  LHSF,/  1  /  AU,  \ 

\  LHSF/,  LHSF// J  VAU//  ~ 


where  the  superscripts  i,  e  denote  the  internal  (vis¬ 
cous,  advective,  inertia)  and  external  force  vectors  re¬ 
spectively.  Before  proceeding  to  the  complete  coupled 
equation  system,  we  have  to  define  the  continuity  of 
the  variables  across  boundaries  or  domains.  The  tem¬ 
peratures  required  by  the  structure  for  thermal  stress 
calculation  are  obtained  from  the  CTD  solver  via  in¬ 
terpolation: 


T,  =  I,<T*  ,  (7) 

where  I,t  is  a  3-D  interpolation  matrix,  which  reverts 
back  to  the  identity  matrix  in  case  that  the  nodes 
of  the  grids  for  structural  and  thermal  analysis  are 
coincident.  In  the  same  way,  the  temperatures  at  the 
‘wetted  surfaces’  of  the  CFD  and  CTD  domains  have 
to  be  the  same: 


T/  =  1/tTt  .  (8) 

Here  I/*  is  now  a  surface  to  surface  interpolation  op¬ 
erator.  The  velocities  at  the  surface  of  the  structure 
and  fluid  domain  are  required  to  be  the  same,  hence: 

v/lr,  =  I/*v,  =  I/5X,  .  (9) 

Finally,  the  thermal  and  mechanical  loads  the  fluid 
exerts  on  the  structure  have  to  be  transferred: 

4 

q/ =  Gt/U/ +  Gt.U,  ,  8/  =  G,/U/  +  G„U,  , 

(10) 

where  we  have  used  G  to  indicate  the  gradient-  or 
derivative-based  operators  involved. 
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The  final  coupled  system  then  takes  the  form: 


'  LHST/;  LHST/t  0  0  -0LG(3 
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Q,I/t  0  0  0 
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X, 
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(11) 


Given  this  complete  system,  we  can  now  define  pos¬ 
sible  coupling  algorithms. 

a)  Tight  coupling:  We  denote  by  tight  coupling  the 
simultaneous  update  of  all  variables,  including  (and 
most  notably)  those  at  the  fluid/structure/thermal 
interface.  This  implies  solving  the  complete  system 
given  by  Eqn.(ll)  in  one  step.  The  formulation  al¬ 
lows  for  different  grids  in  the  CFD,  CSD  and  CTD  do¬ 
mains,  but  the  reader  should  realize  that  the  deriva¬ 
tion  of  the  proper  projection  and  interpolation  ma¬ 
trices  can  be  tedious  in  3-D.  From  a  numerical  point 
of  view,  choosing  this  approach  leads,  for  most  cases, 
to  an  increase  in  the  condition  number  for  the  ma¬ 
trices  to  be  solved,  with  the  associated  solution  dif¬ 
ficulties.  From  a  practical  point  of  view,  choosing 
this  approach  requires  an  almost  complete  re-write  of 
the  CFD,  CSD  and  CTD  codes  into  one  single  cou¬ 
pled  code.  This  implies  a  loss  of  modularity  (different 
numerical  schemes,  different  codes),  as  well  as  the  in¬ 
ability  to  couple  several  CFD,  CSD  and  CTD  codes. 
Moreover,  the  ‘trade-oriented’  aspect  of  each  of  the 
individual  codes  is  blurred  or  lost,  with  the  associ¬ 
ated  extra  expenses  for  retraining  the  user  base. 

b)  Loose  coupling:  We  denote  by  loose  coupling  the 
separate  update  of  the  CFD,  CSD  and  CTD  do¬ 
mains,  with  a  transfer  of  variables  at  the  interface 
or  the  domain.  The  most  common  way  of  realiz¬ 
ing  this  approach  is  by  selecting  a  ‘master  surface’ 
for  a  certain  variable,  and  interpolating  or  projecting 
the  variable  to  the  other  domain  at  the  beginning  of 
the  next  timestep.  For  CFD/CSD/CTD  problems, 
the  most  natural  combination  is  to  select  the  CSD 
surface  location  and  velocity  as  the  ‘master-grid’  for 
displacements  and  temperatures,  and  the  CFD  grid 
as  the  ‘master-grid’  for  the  loads  (pressures,  shear- 
stresses,  heat  fluxes).  The  product  of  displacement 
times  load  yields  work,  making  the  combination  phys¬ 
ically  appealing.  This  approach  may  be  regarded  as 


an  iterative  solution  of  the  combined  system  given  by 
Eqn.(ll).  Each  iterative  pass  is  composed  of  the  fol¬ 
lowing  steps  (see  Figure  3): 

-  Solve  for  CTD  with  imposed  q/; 

-  Solve  for  CSD  with  imposed  s/,T; 

-  Solve  for  CFD  with  imposed  T,,X,,X,. 

This  procedure  may  be  refined  to  include  a  transfer 
of  the  left-hand  side  Jacobians. 

Depending  on  the  time  integration  scheme  used  for 
the  CTD,  CSD  and  CFD  domains,  several  simplifying 
strategies  can  be  employed.  Should  explicit  time  inte¬ 
gration  be  appropriate  to  advance  the  CTD,  CSD  and 
CFD  regions  (as  is  the  case  for  some  of  the  problems 
considered  here),  the  loose  and  tight  coupling  systems 
are  almost  identical,  the  only  error  being  the  mass 
of  fluid  for  the  first  row  of  elements  adjacent  to  the 
solid.  Should  implicit  time  integration  be  best  way 
to  advance  the  CSD  and  CFD  regions  (as  is  the  case 
for  low-frequency  aeroelastic  applications),  the  LHS 
of  the  time-discrete  form  of  Eqn.(ll)  will  contain  en¬ 
tries  of  the  Jacobians  of  f‘.  In  this  case,  the  iterative 
strategy  discussed  above  will  have  to  be  used  for  the 
loose  coupling  approach  if  equivalency  with  the  tight 
coupling  system  is  to  be  achieved.  Finally,  if  only  a 
steady-state  solution  for  the  coupled  fluid-structure 
system  is  sought,  the  loose  coupling  approach  may  be 
used  either  with  explicit  or  implicit  time  integration 
for  the  CTD,  CSD  and  CFD  domains  without  incur¬ 
ring  any  errors. 

The  variables  on  the  boundaries  are  transferred  back 
and  forth  between  the  different  codes  by  a  master 
code  that  directs  the  multi-disciplinary  run.  Two  pos¬ 
sible  implementations  are  possible  here: 
a)  Single  Executable:  Each  ‘discipline  code’  (CFD, 
CSD,  CTD,  ..)  is  seen  as  a  subroutine  or  object 
called  by  the  master  code 
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b)  Multiple  Executables:  Each  ‘discipline  code7 
(CFD,  CSD,  CTD,  ..)  runs  as  an  independent 
process  that  awaits  commands  and  data  from  a 
master  process  or  other  processes.  The  master 
process  synchronizes  the  start  and  interruption 
of  the  different  processes,  and  in  particular  the 
transfer  of  information. 

In  both  cases,  the  transfer  of  geometrical  and  physical 
information  is  performed  between  the  different  codes 
without  affecting  their  layout,  basic  functionality,  and 
coding  styles.  This  is  seen  as  the  main  advantage  of 
this  approach. 

A  tremendous  amount  of  man-years  has  been  de¬ 
voted  to  CFD,  CSD  and  CTD  codes,  incorporating 
into  them  all  the  minor  features  that  make  these 
codes  efficient,  practical,  user-friendly  tools.  The 
central  assumption  made  here  is  that  these  codes 
will  not  be  rewritten,  should  not  be  hampered  in 
their  present  and  future  development,  and  neverthe¬ 
less  can  be  combined  efficiently  to  solve  strongly  cou¬ 
pled  CFD/CSD/CTD  problems. 

For  structures  that  break,  rupture,  or  deform 
markedly  due  to  the  loads  exerted  by  the  fluid,  the 
corresponding  CFD  and/or  CSD  grid  will  require 
some  form  of  remeshing,  which  can  be  either  local 
or  global  in  nature.  If  this  remeshing  can  not  be  done 
automatically,  the  usefulness  of  such  an  approach  will 
always  remain  limited.  Therefore,  automatic  gridding 
techniques  are  an  enabling  technology  for  this  class  of 
problems.  The  CFD  code  employed  here  has,  as  one  of 
its  salient  features,  an  automatic  remeshing  capabil¬ 
ity.  This  capability  is  very  important  for  the  class  of 
fluid-structure  interaction  problems  considered,  and 
will  be  demonstrated  in  the  examples  shown  below. 

3.  CODES  SELECTED 

The  selection  of  the  respective  CFD/CSD/CTD  codes 
was  made  according  to  the  following  guidelines: 

-  The  code  must  be  well  proven; 

-  The  code  must  be  benchmarked; 

-  The  code  must  be  supported; 

-  The  code  must  have  a  user  base/community; 

The  three  candidate  codes  chosen  were:  FE- 
FL098  for  the  fluid,  COSMIC-NASTRAN  for 
linear  structures  DYNA3D  for  the  solid.  A  brief 
overview  of  the  physics  being  modelled,  the  numeri¬ 
cal  techniques  employed,  as  well  as  useful  engineering, 
meshing  and  software  options  available  in  these  two 
codes  is  given  in  the  sequel. 

3.1  CFD  CODE:  FEFLQ98 

a)  Physics:  FEFL098  is  a  simulation  code  for  com¬ 
pressible  and  incompressible  flows.  The  equations 
solved  are  the  Euler,  Laminar  or  Reynolds-averaged 


Navier-Stokes  equations,  as  well  as  the  linear  acous¬ 
tics  equations.  The  turbulence  models  available  are 
the  Smagorinsky,  Baldwin-Lomax  and  k  —  e  models, 
as  well  as  a  user-input  option  via  subroutine.  Equa¬ 
tions  of  state  supported  by  FEFL098  include  ideal 
polytropic  gas,  real  air  EOS  table  look-up,  water  EOS 
table  look-up,  the  JWL  EOS  for  high  explosives  and  a 
link  to  the  general  SESAME  library  of  EOS.  In  order 
to  handle  situations  with  moving  bodies  and/or  mov¬ 
ing  grids,  the  equations  are  solved  in  the  Arbitrary 
Lagrangean-Eulerian  frame  [Don82]. 

Flows  with  particles  are  treated  via  a  second  solid 
phase.  The  particles  interact  with  the  fluid,  exchang¬ 
ing  mass,  momentum  and  energy,  and  are  integrated 
in  a  time-consistent  manner  with  the  fluid. 

b)  Numerics:  The  spatial  discretization  is  accom¬ 
plished  via  finite  element  techniques  on  unstructured 
tetrahedral  grids.  In  order  to  achieve  high  execu¬ 
tion  speeds,  edge-based  data  structures  are  used. 
Both  central  and  upwind  flux  (van  Leer  [vLe74], 
Roe  [Roe81],  AUSM-f  [Lio95])  formulations  are 
used.  For  the  temporal  discretization,  both  Taylor- 
Galerkin  and  Runge-Kutta  time  integration  schemes 
are  possible.  Monotonicity  of  the  solution  may  be 
achieved  through  a  blend  of  second  and  fourth  order 
dissipation  [Jam93],  pressure-based,  Flux-Corrected 
Transport  (FCT)  [Loh87],  or  classic  TVD  limitors. 
The  particles  are  integrated  using  a  second-order 
Runge-Kutta  scheme,  and  optimal  tracking  tech¬ 
niques  [Loh90,  Siv94]  have  been  implemented  to  ex¬ 
pedite  the  transfer  of  information  between  fields  and 
particles. 

c)  Engineering:  In  order  to  handle  situations  with 
moving  bodies,  FEFL098  offers  a  variety  of  options: 
prescribed  motion,  6-DOF  integration  based  on  aero¬ 
dynamic  forces,  and  link  to  CSD  codes. 

A  variety  of  boundary  conditions  can  be  prescribed 
to  simulate  as  faithfully  as  possible  engineering  flows: 
sub-,tran-,  and  supersonic  in/outflow,  total  pressure 
inflow  b.c.,  static  pressure,  mach- number  and  normal 
flux  outflow  b.c.,  porous  walls,  and  periodicity.  At  the 
same  time,  a  large  variety  of  diagnostics  is  produced 
by  the  code  to  track  or  display  specific  parts  of  the 
flowfield  that  are  of  special  interest:  0-D  probes  (e.g. 
for  station  time  history),  1-D  line  segments  for  x/y 
display,  2-D  planes  or  iso-surfaces  for  contouring,  flux 
trough  surfaces,  force  and  moment  data  on  surfaces 
or  bodies,  on-line  display  of  the  flowfield,  etc. 

d)  Meshing  Options:  FEFL098  allows  for  automatic 
adaptive  h-refinement  [Loh92]  and  automatic  remesh¬ 
ing  [Loh88,90]  in  order  to  enhance  the  solution  accu¬ 
racy,  even  for  situations  with  moving  bodies. 

e)  Software:  FEFL098  is  written  in  FORTRAN- 
77  and  fully  vectorized.  Renumbering  tech¬ 
niques  [Loh93,97]  are  used  extensively  in  order  to 
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avoid  cache-misses  on  RISC-based  machines  and 
cache-line  contingencies  on  shared  memory  parallel 
machines.  For  distributed  memory  parallel  machines 
domain  splitting  [Loh93]  and  message  passing  are  em¬ 
ployed.  The  code  runs  on  all  major  workstations, 
vector-supercomputers  and  parallel  platforms. 

FEFL098  is  a  well-proven  and  benchmarked  code 
used  extensively  by  the  authors  and  others  in  the  CFD 
community  [Bau93,94)95,96,98]. 

3.2  Linear  CSD  CODE:  COSMIC-NASTRAN 

a)  Physics:  COSMIC-NASTRAN  is  a  simu¬ 
lation  code  especially  suited  for  solids  undergo¬ 
ing  elastic  deformations.  In  addition,  COSMIC- 
NASTRAN  may  be  used  to  solve  any  linear  second 
order  elliptic  or  parabolic  PDE,  e.g.  Laplace,  Poisson 
and  Helmholtz  equations,  time-dependent  Hear  heat 
conductions,  etc.  The  conservation  equations  for  mo¬ 
mentum  are  written  and  solved  for  in  the  Lagrangian 
frame  of  reference.  The  small  deformation,  small 
strain  formulation  is  employed  throughout.  The  code 
incorporates  isotropic  or  orthotropic  elastic  material 
models,  and  many  variations  for  composites  [Mac81]. 

b)  Numerics:  The  spatial  discretization  is  accom¬ 
plished  via  finite  element  techniques  on  unstructured 
grids.  A  very  large  number  of  different  element 
types  is  supported  (it  is  rumored  that  COSMIC- 
NASTRAN  is  over  two  million  lines  long  !),  among 
them  truss  beam  elements,  several  triangular  and 
quadrilateral  shell  elements,  and  tetrahedral,  pris¬ 
matic  and  hexaheral  solid  elements.  The  temporal 
discretization  is  carried  out  using  an  implicit  differ¬ 
ence  method,  which  is  unconditionally  stable.  The 
resulting  equation  systems  for  static,  dynamic  and 
eigenvalue  analysis  are  solved  directly,  using  vari¬ 
ants  of  Gaussian  solution  techniques.  COSMIC- 
NASTRAN  incorporates  elaborate  bandwidth  min¬ 
imization  algorithms  to  reduce  the  cost  of  direct 
solvers. 

c)  Engineering:  Given  its  large  history  as  the  de 
facto  standard  code  for  linear  structural  mechanics, 
COSMIC-NASTRAN  incorporates  a  very  large 
number  of  convenient  features  that  have  proven  useful 
for  the  efficient  modeling  of  engineering  problems. 

A  large  variety  of  diagnostics  is  produced  by  the  code 
to  track  or  display  specific  parts  of  the  structure  that 
are  of  special  interest:  0-D  probes  (e.g.  for  station 
time  history),  1-D  line  segments  for  x/y  display,  2- 
D  planes  or  iso-surfaces  for  contouring,  stress,  strain, 
force  and  moment  data  on  surfaces  or  fields,  etc. 

d)  Meshing  Options:  COSMIC-NASTRAN  in  its 
current  state  does  not  allow  for  automatic  adaptive 
h-refinement  or  automatic  remeshing. 


e)  Software:  COSMIC-NASTRAN  is  written  in 
FORTRAN-66.  The  code  was  written  before  the  ad¬ 
vent  of  vector  or  parallel  machines  CRAY-type  ma¬ 
chines.  Given  that  the  number  of  degrees  for  linear 
structures  are  typically  only  a  fraction  of  those  used 
for  flow  simulations,  this  is  not  a  major  drawback. 
COSMIC-NASTRAN  has  been  ported  to  all  ma¬ 
jor  platoforms,  from  PC’s  to  parallel  platforms. 

COSMIC-NASTRAN  is  a  well-proven,  bench- 
marked  code  used  extensively  in  industry,  government 
laboratories  and  academia. 

3.3  Non-Linear  CSD  CODE:  DYNA3D 

a)  Physics:  DYNA3D  is  a  simulation  code  especially 
suited  for  solids  undergoing  rapid  and  severe  defor¬ 
mation.  The  conservation  equations  for  momentum 
are  written  and  solved  for  in  the  Lagrangian  frame  of 
reference.  The  large  deformation,  large  strain  formu¬ 
lation  is  employed  throughout.  The  code  incorporates 
forty-one  different  material  models,  among  them  lin¬ 
ear  elastic,  linear  elastic-plastic,  strain-rate  sensitive 
steel  with  fracture,  hardening  material  models,  a  ge¬ 
ological  cap  model  for  soil  materials,  and  a  variety 
of  concrete  models  that  simulate  fracturing  behav¬ 
ior  [Whi93].  DYNA3D  also  offers  eleven  equations 
of  state  models,  including  equations  of  state  for  high 
explosives. 

b)  Numerics:  The  spatial  discretization  is  accom¬ 
plished  via  finite  element  techniques  on  unstructured 
grids.  The  elements  available  for  structural  mod¬ 
elling  are  one  truss  and  two  beam  elements,  sev¬ 
eral  quadrilateral  shell  elements  (e.g.  Belytschko- 
Tsai  [Bel84],  Hughes-Liu  [Hug81],  YASE  [Whi93], 
and  QPH  [Bel94],  and  hexaheral  elements  with  one- 
point  integration  for  the  3-D  solids.  The  shells  al¬ 
low  for  multiple  integration  points  across  the  thick¬ 
ness,  making  it  possible  to  accurately  treat  nonlin¬ 
ear  plastic  behavior  of  simple  and  composite  shells. 
Several  hourglass  control  options  may  be  used.  We 
have  found  that  the  Flanagan-Belytschko  hourglass 
control  [Fla81]  works  best  for  the  unstructured  hex- 
ahedral  grids  we  most  often  employ.  The  temporal 
discretization  is  carried  out  using  an  explicit  central 
difference  method,  which  is  conditionally  stable. 

c)  Engineering:  DYNA3D  incorporates  a  large  num¬ 
ber  of  convenient  features  that  prove  especially  useful 
for  realistic  engineering  problems.  The  following  is  a 
non-exhaustive  list  of  those  features  that  were  par¬ 
ticularly  relevant  to  our  class  of  applications.  The 
user  may  prescribe  non-reflecting  boundary  condi¬ 
tions  which  eliminate  stress  wave  reflections  at  model 
boundaries,  making  it  possible  to  use  smaller  mod¬ 
els.  There  are  twelve  types  of  sliding-interface  algo¬ 
rithms  to  treat  different  interface  conditions  between 
interacting  parts.  Sliding-interface  algorithms  permit 
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the  treatment  of  contact  conditions  with  friction,  gap 
opening,  spotwelds,  etc.  For  civil  engineering  applica¬ 
tions,  there  are  rebar-concrete  interaction  algorithms 
which  include  degradation  and  failure  of  bond. 

A  large  variety  of  diagnostics  is  produced  by  the  code 
to  track  or  display  specific  parts  of  the  structure  that 
are  of  special  interest:  0-D  probes  (e.g.  for  station 
time  history),  1-D  line  segments  for  x/y  display,  2- 
D  planes  or  iso-surfaces  for  contouring,  stress,  strain, 
force  and  moment  data  on  surfaces  or  fields,  etc. 

d)  Meshing  Options:  DYNA3D  in  its  current  state 
does  not  allow  for  automatic  adaptive  h-refinement  or 
automatic  remeshing.  Work  is  currently  in  progress 
to  incorporate  these  feature  into  DYNA3D 

e)  Software:  DYNA3D  is  written  in  FORTRAN- 
77  and  fully  vectorized.  The  code  was  written  with 
CRAY-type  machines  in  mind,  but  runs  well  on  all 
major  workstations,  vector-supercomputers  and  some 
parallel  platforms.  It  employs  dynamic  memory  allo¬ 
cation,  making  it  capable  of  solving  very  large  prob¬ 
lems. 

DYNA3D  is  a  well-proven  and  benchmarked  code 
used  extensively  by  the  authors  and  others  in  the  CSD 
community  [Gou82,  Cha82,  Whi93].  DYNA3D  was 
developed  at  the  Lawrence  Livermore  National  Lab¬ 
oratories  by  Dr.  John  Hallquist  with  contributions 
from  Dr.  David  Benson  and  Dr.  Robert  Whirley. 
DYNA3D  has  been  successfully  used  for  a  large 
number  of  applications,  including  nuclear  and  con¬ 
ventional  weapon  design,  car  and  airplane  crashwor¬ 
thiness  studies,  analysis  of  reinforced  structures  such 
as  bunkers,  tunnels,  and  silos,  as  well  as  spent  nu¬ 
clear  shipping  cases.  It  is  supported  and  maintained 
by  Lawrence  Livermore  National  Laboratories. 

4.  SURFACE  TO  SURFACE  INTERPOLATION 

One  of  the  main  aims  of  the  proposed  approach  is 
to  couple  the  different  codes  in  such  a  way  that  each 
one  of  the  codes  used  is  modified  in  the  least  possible 
way.  Moreover,  the  option  of  having  different  grids 
for  different  disciplines  (CFD/CSD/CTD/CEM..),  as 
well  as  adaptive  grids  that  vary  in  time,  implies  that 
in  most  cases  no  fixed  common  variables  will  exist  at 
the  boundaries.  Therefore,  fast  and  accurate  interpo¬ 
lation  techniques  are  required.  As  the  grids  may  be 
refined/coarsened  during  timesteps,  and  the  surface 
deformations  may  be  severe,  the  interpolation  proce¬ 
dures  have  to  combine  speed  with  generality.  Algo¬ 
rithmic  aspects  of  procedures  that  fulfill  these  require¬ 
ments  have  been  discussed  in  [Knu73,  Sed83,  Loh95] 
(volume)  and  [Loh95,  Mam95,  Ceb97]  (surface),  and 
do  not  need  to  be  repeated  here. 


4.1  Unwrapping  of  Doubly  Defined  Faces 

Consider  the  common  case  of  thin  structural  ele¬ 
ments,  e.g.  roofs,  walls,  stiffeners,  etc.  surrounded 
by  a  fluid  medium.  The  structural  elements  will  be 
discretized  using  shell  elements.  These  shell  elements 
will  be  affected  by  loads  from  both  sides.  Most  CSD 
codes  require  a  list  of  faces  on  which  loads  are  exerted. 
This  implies  that  the  shell  elements  loaded  from  both 
sides  will  appear  twice  in  this  list.  In  order  to  be  able 
to  incorporate  thickness  and  interpolate  between  CSD 
and  CFD  surface  grids  in  a  unique  way,  these  doubly 
defined  faces  are  identified,  and  new  points  are  intro¬ 
duced.  The  first  step  is  to  identify  the  doubly  defined 
faces.  A  linked  list  that  stores  the  faces  surrounding 
each  point  is  constructed.  Each  face  is  then  checked 
by  performing  an  exhaustive  comparison  of  the  points 
of  each  of  the  faces  surrounding  the  first  node  of  each 
face.  This  will  identify  doubly  defined  faces  in  O(Nj) 
complexity  , where  Nf  is  the  number  of  faces.  Should 
this  check  reveal  the  existence  of  doubly  defined  faces, 
new  points  are  introduced  using  an  unwrapping  pro¬ 
cedure  [Loh95]. 

5.  SURFACE  TRACKING  TECHNIQUES 

An  important  question  that  needs  to  be  addressed  is 
how  to  make  the  different  grids  follow  one  another 
when  deforming  surfaces  are  present.  Consider  the 
typical  aeroelastic  case  of  a  wing  deforming  under 
aerodynamic  loads.  For  accuracy  purposes,  the  CFD 
discretization  will  be  fine  on  the  surface,  and  the  sur¬ 
face  will  be  modelled  as  accurately  as  possible  from 
the  CAD-CAM  data  at  the  start  of  the  simulation. 
On  the  other  hand,  a  CSD  discretization  that  models 
the  wing  as  a  series  of  plates  may  be  entirely  appropri¬ 
ate.  If  one  would  force  the  CFD  surface  to  follow  the 
CSD  surface,  the  result  would  be  a  wing  with  no  thick¬ 
ness,  clearly  inappropriate  for  an  acceptable  CFD  re¬ 
sult.  On  the  other  hand,  for  strong  shock/object  in¬ 
teractions  with  large  plastic  deformations  and  possi¬ 
ble  tearing,  forcing  the  CFD  surface  to  follow  exactly 
the  CSD  surface  is  the  correct  way  to  proceed.  These 
two  examples  indicate  that  more  than  one  strategy 
may  have  to  be  used  to  interpolate  and  move  the  sur¬ 
face  of  the  CFD  mesh  as  the  structure  moves.  We 
have  incorporated  the  following  techniques: 

a)  Exact  tracking  with  linear  interpolation.  This  is 
the  most  straightforward  case,  but,  as  could  be  seen 
from  the  example  described  above,  may  lead  to  bad 
results. 

b)  Exact  tracking  with  quadratic  interpolation.  In 
this  case,  the  surface  normals  are  recovered  at  the 
end-points  of  the  surface  triangulation.  For  each 
edge  of  the  triangulation,  the  midpoint  is  extrapo¬ 
lated  using  a  Hermitian  polynomial  [Loh95].  In  this 
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way,  quadratic  triangles  are  obtained.  The  surface 
is  then  approximated/interpolated  using  this  higher 
order  surface. 

c)  Tracking  with  initial  distance  vector.  In  many  in¬ 
stances,  e.g.  thick  shells,  the  CFD  and  CSD  domains 
will  never  coincide.  A  way  to  circumvent  this  dilemma 
is  to  compute  the  difference  vector  between  the  initial 
CSD  and  CFD  surfaces,  and  maintain  this  vector  (al¬ 
lowing  for  translation  and  rotation)  for  the  duration 
of  the  coupled  run.  Several  options  are  possible  here, 
e.g.  surface  normals,  point  normals,  etc.  [Ceb97], 

An  important  area  still  under  investigation  is  how  to 
handle,  in  an  efficient  and  automatic  way,  models  that 
exhibit  incompatible  dimensionalities.  An  example 
for  such  a  ‘reduced  model’  would  be  an  aerothermoe- 
lastic  problem  where  the  wing  structure  is  modeled  by 
a  torsional  beam  (perfectly  acceptable  for  the  lowest 
eigenmodes),  the  fluid  by  a  3-D  volumetric  mesh,  and 
the  heat  transfer  by  a  network  model.  It  is  easy  to 
see  that  the  proper  specification  of  movement  for  the 
CFD  surface  based  on  the  1-D  beam,  as  well  as  the 
load  transfer  from  the  fluid  to  the  beam  and  network 
points  represent  non-trivial  problems  for  a  general, 
user-friendly  computing  environment. 

6.  CFD-CSD/CTD  LOAD  TRANSFER 

During  each  iteration,  the  CFD  loads  have  to  be 
transferred  to  the  CTD/CSD  grids.  Simple  point- 
wise  interpolation  can  be  employed  for  those  cases  in 
which  the  elements  of  the  CTD/CSD  surface  meshes 
are  smaller  or  of  similar  size  than  the  elements  of  the 
CFD  surface  mesh.  However,  this  approach  is  not 
conservative,  and  will  not  yield  accurate  results  for 
the  common  case  of  CTD/CSD  surface  elements  be¬ 
ing  larger  than  their  CFD  counterpart.  Considering 
without  loss  of  generality  the  pressure  loads  only,  it 
is  desirable  to  attain: 

p,(x)ssP/(x)  ,  (12) 

while  being  conservative  in  the  sense  of: 

f  =  J p,ndT  =  J pjndT  ,  (13) 

where  pj,p,  denote  the  pressures  on  the  fluid  and 
solid  material  surfaces,  and  n  is  the  normal  vector. 
These  requirements  may  be  combined  by  employing  a 
weighted  residual  method.  With  the  approximations: 

P>  =  KPi>  ,  PS  =  Njpjj  ,  (14) 

we  have 

J NiNidTPj,  =  j NiN}dTPjJ  ,  (15) 

which  may  be  rewritten  as: 
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Mp,  =  r  =  Lpj  .  (16) 

Here  M  is  a  ‘consistent  mass-matrix’,  and  L  a  ‘load¬ 
ing  matrix’.  The  solution  of  this  coupled  system  of 
equations  is  obtained  iteratively  in  the  now  familiar 
way: 


Mr(p,;+1-pt)=r-M  p*J  ,  (17) 

where  M*  is  the  ‘lumped  mass  matrix’.  Typically, 
three  iterations  are  sufficient  to  achieve  an  accurate 
result.  One  can  also  show  that  Eqn.(16)  is  equivalent 
to  the  least-squares  minimization  of 

1  =  J\p,-pj}2dr,  (18) 

We  remark  that  the  weighted  residual  method  is 
conservative  in  the  sense  of  Eqn.(13).  The  sum 
of  all  shape-functions  at  any  given  point  is  unity 
(52,-  A^x)  =  1),  and  therefore: 


J  p,dT  =  J  NjdTpj,  =  J  £  NiNidTPjs 
=  E/  NiNidVPjs  =  J2j  NiN}dTpjf 
=  J  Y,NiNsdTPif  =  J  NidrPjf  =  J pjd r  (19) 


The  most  problematic  part  of  the  weighted  residual 
method  is  the  evaluation  of  the  integrals  appearing 
on  the  right-hand  side  of  Eqn.(15).  When  the  CFD 
and  CSD  surface  meshes  are  not  nested,  this  is  a 
formidable  task.  The  adaptive  Gaussian  quadrature 
procedure  proposed  in  [Ceb97]  has  proven  very  suc¬ 
cessful  for  the  evaluation  of  these  integrals. 


7.  EXAMPLE  RUNS 

7.1  Generic  Weapon  Fragmentation:  This  second 
example  shows  a  fully  coupled  CFD/CSD  run. 
The  structural  response  was  calculated  using  GA- 
DYNA  [Pel95,  Pel97,  Pel98],  an  enhanced  version 
of  DYNA3D  that  incorporates  adaptive  refinement 
and  improved  contact  algorithms.  The  structural  ele¬ 
ments  were  assumed  to  fail  once  the  average  strain  in 
an  element  exceeded  60%.  At  the  beginning,  the  CFD 
domain  consisted  of  two  separate  regions.  These  re¬ 
gions  connected  as  soon  as  fragmentation  started.  In 
order  to  handle  narrow  g&ps  during  the  break-up  pro¬ 
cess,  the  failed  CSD  elements  were  shrunk  by  a  frac¬ 
tion  of  their  size.  This  alleviated  the  timestep  con¬ 
straints  imposed  by  small  elements  withoutaffecting 
the  overall  accuracy.  The  final  breakup  led  to  approx¬ 
imately  1200  objects  in  the  flowfield.  Figures  4,5  show 
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the  fluid  pressure  and  CSD  surface  velocity  at  three 
different  times  during  the  simulation.  Typical  meshes 
for  this  simulation  were  of  the  order  of  8.0  Mtets,  and 
the  simulations  required  of  the  order  of  50  hours  on 
the  SGI  Origin  2000  running  on  32  processors. 

7.2  Nose-Cone:  As  a  second  example,  we  consider  a 
generic  nose-cone.  The  cone  is  1  m  long  and  its  an¬ 
gle  is  approximately  20°.  The  freestream  conditions 
were  set  as  follows:  density  p  =  1.25  kg/m3,  velocity 
v  =  103  m/sec,  pressure  p  =  105  N/m2,  free  stream 
viscosity  p  =  1.8  •  10-5  kg/m/sec,  initial  tempera¬ 
ture  T  =  273  K,  heat  coefficient  at  constant  pressure 
cp  =  1000,  polytropic  coefficient  7  =  1.4,  Prandtl-nr. 
of  Pr  =  0.71  and  angle  of  attack  of  a  =  10°.  This 
yields  a  Reynolds-nr.  of  approximately  Re  =  7  •  107 
based  on  the  length  of  the  cone,  and  a  Mach-nr.  of 
Moo  =  3.0.  The  effects  of  turbulence  were  simulated 
using  the  Baldwin-Lomax  turbulence  model.  Suther¬ 
land’s  law  was  assumed  for  the  dependence  of  viscos¬ 
ity  and  conductivity  on  temperature.  A  suitable  grid 
with  highly  stretched  elements  in  the  boundary  layer 
was  generated  using  the  advancing  layers/  advancing 
front  technique.  This  CFD  grid  had  approximately 
574  Ktet  elements.  The  actual  shell  thickness  of  the 
cone  was  assumed  be  6  =  1  mm,  with  three  reinform- 
cement  plates  of  6  =  2  mm.  The  material  proper¬ 
ties  for  the  shells  were  as  follows:  Young’s  modulus 
E  =  2  •  109  N/m2,  Poisson’s  ratio  v  =  0.30,  density 
P  —  7.84  •  103  kg/m3,  and  thermal  expansion  coef¬ 
ficient  /}  =  1.25  •  10-51/A.  The  structure  was  dis¬ 
cretized  with  the  triangular  shell  element  CTRIA2. 
The  same  discretization  was  also  used  for  the  calcula¬ 
tion  of  the  temperature  field.  The  cone  was  assumed 
to  be  clamped  at  the  base;  also,  at  this  location,  the 
temperatures  were  assumed  to  be  fixed.  The  solution 
was  initiated  by  converging  the  fluid-thermal  prob¬ 
lem,  without  any  structural  deformation.  This  took 
approximately  10  iterations.  Thereafter,  the  fluid- 
structure-thermal  problem  was  solved  in  only  2  itera¬ 
tions.  Figures  6,7  show  grids  and  the  results  obtained. 

7.3  Deforming  Panel:  As  a  second  example,  we  con¬ 
sider  supersonic  flow  over  a  deforming  panel,  as  pro¬ 
posed  by  Thornton  [Tho88].  A  diagram  of  the  initial 
conditions  is  shown  in  Figure  8.  The  panel  is  0.1016  m 
long  and  0.00254  m  thick.  The  freestream  conditions 
were  set  as  follows:  density  p  =  1.25  kg/m3,  veloc¬ 
ity  v  =  1.42  •  103  m/sec,  pressure  p  =  105  N/m2, 
free  stream  viscosity  p  =  1.8  •  10-5  kg/m/sec,  initial 
temperature  T  =  273  K,  heat  coefficient  at  constant 
pressure  cp  =  1000,  polytropic  coefficient  7  =  1.4  and 
Prandtl-nr.  of  Pr  =  0.71.  This  yields  a  freestream 
Mach-nr.  of  =  4.26.  The  effects  of  turbulence 
were  simulated  using  the  Baldwin-Lomax  turbulence 
model,  and  a  l/7th  profile  was  assumed  at  inflow. 
Sutherland’s  law  was  assumed  for  the  dependence  of 
viscosity  and  conductivity  on  temperature.  A  suit¬ 


able  grid  with  highly  stretched  elements  in  the  bound¬ 
ary  layer  was  generated  using  the  advancing  layers/ 
advancing  front  technique.  This  CFD  grid  had  ap¬ 
proximately  174  Ktet  elements.  The  material  prop¬ 
erties  for  the  panel  were  as  follows:  Young’s  modulus 
E  —  2  •  109  N/m2,  Poisson’s  ratio  v  —  0.30,  density 
P  —  7.84  •  103  kg/m3,  and  thermal  expansion  coef¬ 
ficient  /?  =  1.25  •  10_51/A\  The  structure  was  dis¬ 
cretized  with  hexahedral  volume  elements  CIHEX1. 
The  same  discretization  was  also  used  for  the  calcula¬ 
tion  of  the  temperature  field.  The  plate  was  assumed 
to  be  clamped  at  the  base  extremities  and  adibatic 
walls  were  assumed  throughout,  except  at  the  contact 
of  panel  and  fluid.  The  final  result  was  obtained  after 
5  fluid-structure-thermal  iterations.  Figure  9  shows 
the  surface  of  the  CFD  grid  and  the  results  obtained. 

8.  CONCLUSIONS  AND  OUTLOOK 

A  fluid/structure/thermal  interaction  algorithm 
based  on  the  loose  coupling  of  production  CFD,  CSD 
and  CTD  codes  has  been  described.  The  algorithm 
allows  a  cost-effective  re-use  of  existing  software,  with 
a  minimum  amount  of  alterations  required  to  account 
for  the  interaction  of  the  different  media.  Several 
example  runs  using  FEFL098  as  the  CFD  code, 
and  COSMIC-NASTRAN  or  DYNA3D  as  the 
CSD/CTD  code,  demonstrate  the  effectiveness  of  the 
proposed  methodology. 

Future  developments  will  center  on: 

-  Treatment  of  reduced  models,  or  models  with  in¬ 
compatible  dimensionalities; 

-  Improved  reliability  for  complex  geometries  un¬ 
dergoing  severe  deformations,  especially  when 
contact  is  present; 

-  Extensions  to  other  multidisciplinary  problems, 
including  electromagnetic  loads;  and 

-  Extensions  across  different  spatio/temporal 
scales,  i.e.  the  coupling  of  micro-physics  codes 
to  these  macro-physics  codes. 
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Figure  2  Coupling  Different  Discretizations 
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Figure  3  Loose  Coupling 


Figure  4  Generic  Weapon  Fragmentation:  Pressure  at  Different  Times 


Figure  5  Generic  Weapon  Fragmentation:  Surface  Velocity  at  Different  Times 
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Figure  6  Nose-Cone:  Surface  Grids  for  CFD  and  CSD/CTD 


Figure  7  CFD/CSD/CTD  Results  Obtained 
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Abstract.  Essential  methodologies  for  the  routine  simulation  of  fluid-structure  interaction  problems  where 
rupture  and  topology  change  are  present  are  developed.  These  include:  Automatic  surface  and  topology 
reconstruction,  automatic  grid  sizing  techniques,  parallel  remeshing  and  interpolation  algorithms  for  topo¬ 
logically  different  domains.  Numerical  examples  show  the  effectiveness  of  the  developed  methodology. 
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1.  INTRODUCTION 

The  advent  of  advanced  numerical  techniques,  af¬ 
fordable  3-D  graphics  and  powerful  compute  servers 
over  the  last  decade  has  led  to  a  rapid  maturing  of 
the  ‘core’  disciplines  that  encompass  Computational 
Mechanics:  Computational  Structural  Dynamics 
(CSD),  Computational  Fluid  Dynamics  (CFD)  and 
Computational  Thermodynamics  (CTD).  The  next 
logical  step  has  been  the  development  of  algorithms 
for  the  solution  of  interdisciplinary  problems.  An 
important  class  of  problems  that  fall  under  this  cat¬ 
egory  is  given  by  Fluid-Structure  interaction.  These 
problems  are  characterized  by  changes  of  (struc¬ 
tural)  geometry  due  to  fluid  pressure,  shear  and  heat 
loads  that  have  a  considerable  effect  on  the  flowfield, 
changing  the  loads  in  turn.  Examples  of  industrial 
problems  that  fall  under  this  category  are:  steady- 
state  aerodynamics  of  wings  under  cruise  condi¬ 
tions,  such  as  civilian  and  military  planes;  aeroe- 
lasticity  of  vibrating,  i.e.  elastic  structures,  such 
as  flutter  [Bat88,  Bos93,  Byu94,  Rau93,  Alo95, 

Byu98,  Kim99]  and  buzz  (aeroplanes,  nozzles,  tur¬ 
bines),  galloping  (cables,  bridges  [Kva99]),  noise 
control  [Eve90,  Eve91]  (cars,  trains,  submarines) 
and  maneuvering  and  control  (missiles,  drones); 

‘weak  and  nonlinear’  structures,  such  as  wetted 


membranes  (parachutes,  airbags  [Mes96],  parasols, 
tents,  sails  [Jac87])  and  biological  tissues  (hearts, 
lungs);  ‘strong  and  nonlinear’  structures,  such  as 
shock-structure  interaction  (command  and  control 
centers,  bunkers,  vehicles  [Bau93,  Bau96,  Kam96], 
weapon  fragmentation  [Bau98,  Bau99]),  hypersonic 
flight  vehicles  [Tho88,  Loh98b];  and  variable  geom¬ 
etry  vehicles. 


Figure  1  Fluid-Structure-Thermal  Interaction 
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The  most  important  question  is  how  to  combine 
these  different  disciplines  to  arrive  at  an  accurate, 
modular  and  cost-effective  simulation  approach  that 
can  handle  an  arbitrary  number  of  disciplines  at  the 
same  time.  Considering  the  fluid-structure-thermal 
interaction  problem  as  an  example,  we  see  from  the 
list  of  possibilities  displayed  in  Figure  1  that  any 
multi-disciplinary  capability  must  have  the  ability  to 
quickly  switch  between  approximation  levels,  mod¬ 
els,  and  ultimately  codes.  It  is  clear  that  only  those 
approaches  that  allow  maximum  flexibility,  i.e.: 

-  Linear  and  nonlinear  CFD,  CSD  and  CTD  mod¬ 
els; 

-  Different,  optimally  suited  discretizations  for 
CFD,  CSD  and  CTD  domains; 

-  Modularity  in  CFD,  CSD  and  CTD  models  and 
codes; 

-  Fast  multi-disciplinary  problem  definition; 

-  Fully  automatic  grid  generation  for  arbitrary 
geometrical  complexity;  and 

-  Integrated  multi-disciplinary  visualization 
[Ceb98]  and  data  reduction 

will  survive  in  the  long  run.  The  question  of  how  to 
couple  CSD  and  CFD  codes  has  been  treated  exten¬ 
sively  in  the  literature.  Two  main  approaches  have 
been  pursued  to  date:  strong  coupling  and  loose  cou¬ 
pling.  The  strong  or  tight  coupling  technique  solves 
the  discrete  system  of  coupled,  nonlinear  equations 
resulting  from  the  CFD,  CSD,  CTD  and  interface 
conditions  in  a  single  step.  For  an  extreme  ex¬ 
ample  of  the  tight  coupling  approach,  where  even 
the  discretization  on  the  surfaces  was  forced  to  be 
the  same,  see  [Tho88].  The  loose  coupling  technique 
solves  the  same  system  using  an  iterative  strategy 
of  repeated  ‘CFD  solution  followed  by  CTD  solu¬ 
tion  followed  by  CSD  solution’  until  convergence  is 
achieved  (see  Figure  2). 


Figure  2  Loose  Coupling 


Special  cases  of  this  second  approach  include  the 
direct  coupling  in  time  of  explicit  CFD  and  CSD 
codes  and  the  incremental  load  approach  of  steady 
aero-  and  hydro-elasticity.  The  variables  on  the 
boundaries  are  transferred  back  and  forth  between 
the  different  codes  by  a  master  code  that  directs 
the  multi-disciplinary  run.  Each  code  (CFD,  CSD, 
CTD,  ..)  is  seen  as  a  subroutine,  or  object,  that 
is  called  by  the  master  code,  or  as  a  series  of  pro¬ 
cesses  that  communicate  via  message  passing.  This 
implies  that  the  transfer  of  geometrical  and  physi¬ 
cal  information  is  performed  between  the  different 
codes  without  affecting  their  efficiency,  layout,  ba¬ 
sic  functionality,  and  coding  styles.  At  the  same 
time,  different  CSD,  CTD  or  CFD  codes  may  be 
replaced,  making  this  a  very  modular  approach. 
This  allows  for  a  straightforward  re-use  of  exist¬ 
ing  codes  and  the  choice  of  the  ‘best  model’  for  a 
given  application.  The  information  transfer  soft¬ 
ware  may  be  developed,  to  a  large  extent,  indepen¬ 
dently  from  the  CSD,  CTD  and  CFD  codes  involved, 
again  leading  to  modularity  and  software  reuse.  For 
this  reason,  this  approach  has  gained  widespread 
acceptance  [Gur90,  Loh90,  Rau93,  Gur93,  Byu94, 
Loh95a,  Bau96,  Mes96,  Coc97,  Bau98,  Byu98, 
Gri98,  Loh98b,  Bau99,  Kim99]. 

Optimal  discretizations  for  the  CSD  and  CFD  prob¬ 
lem  are,  in  all  probability,  not  going  to  be  the 
same.  As  an  example,  consider  a  commercial  air¬ 
craft  wing  undergoing  aeroelastic  loads.  For  an  ac¬ 
curate  CFD  solution  using  the  Euler  equations,  an 
accurate  surface  representation  with  60-120  points 
in  the  chord-direction  will  be  required.  For  the  CSD 
model,  a  20  x  40  mesh  of  plate  elements  may  be  more 
than  sufficient.  Any  general  fluid-structure  coupling 
strategy  must  be  able  to  handle  efficiently  the  in¬ 
formation  transfer  between  different  surface  repre¬ 
sentations.  This  is  not  only  a  matter  of  fast  in¬ 
terpolation  techniques  [Loh95b,  Mam95],  but  also 
of  accuracy,  load  conservation  [Ceb97a,b],  geomet¬ 
rical  fidelity  [Ceb97b],  and  temporal  synchroniza¬ 
tion  [Les96,  Ceb97b].  When  considering  different 
mesh  sizes  for  the  CSD  and  CFD  surface  represen¬ 
tation,  the  enforcement  of  accuracy  in  the  sense  of: 

as(x)  »  <r/(x)  (1) 

and  conservation  in  the  sense  of: 

f=/ crsndr  =  J  a  ftidT 
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(2) 


proves  to  be  non-trivial.  The  best  way  to  date 
to  handle  this  problem  for  similar  surface  rep¬ 
resentations  is  via  an  adaptive  Gaussian  quadra¬ 
ture  [Ceb97a].  In  many  instances,  the  CSD  model 
will  either  be  coarse  as  compared  to  the  CFD  model, 
or  may  even  be  on  a  different  modeling  or  abstrac¬ 
tion  level.  On  the  other  hand,  it  is  the  CSD  model 
that  dictates  the  deformation  of  the  CFD  surface. 
It  is  not  difficult  to  see  that  an  improper  transfer 
of  CSD  deformation  to  the  CFD  surface  mesh  can 
quickly  lead  to  loss  of  geometrical  fidelity.  Although 
a  number  of  recovery  techniques  have  been  proposed 
to  date  [Gur93,  Loh96,  Ceb97b],  the  proper  surface 
deformation  of  ‘non-glued7  CFD,  CSD  and  CTD  sur¬ 
faces,  and  an  error  indicator  to  warn  the  unsuspect¬ 
ing  user,  still  represent  unanswered  questions. 

Having  described  the  broader  context  of  fluid- 
structure-thermal  interaction,  we  now  turn  our  at¬ 
tention  to  a  particular  class  of  problems:  shock- 
structure  interaction,  with  rupture  and  topology 
change.  After  reviewing  the  possible  approaches  in 
Section  2,  the  elements  required  for  a  satisfactory 
solution  of  this  class  of  problems  are  dealt  with  in 
turn:  Surface  Reconstruction  and  Definition  (Sec¬ 
tion  3),  Automatic  Remeshing  (Sections  4,5)  and 
Interpolation  (Section  6).  Some  examples  are  given 
in  Section  7.  Finally,  in  Section  8,  conclusions  are 
drawn  and  future  work  areas  are  described. 

2.  SHOCK-STRUCTURE  INTERACTION 

The  interaction  of  strong  shocks  with  structures  will, 
in  most  instances,  lead  to  some  form  of  structural 
failure  (cracking,  spalation,  breach,  failure,  collapse, 
etc.).  Within  a  loose  coupling  approach  (see  Fig¬ 
ure  2),  this  implies  that  the  Netted  CSD  surface7 
will  change  in  time.  This  change  in  structural  in¬ 
tegrity  also  leads  to  new  zones  or  volumes  for  the 
fluid  that  formerly  were  not  present,  implying  that 
automatic  gridding  based  on  discrete  surface  data  is 
essential.  Within  CFD,  two  grid  systems  have  ap¬ 
peared  that  can  handle  automatically  such  time-  and 
topology- varying  geometries: 

a)  Rigid,  non-surface- conforming  grids  and 

b)  Moving,  surface-conforming  grids. 

We  briefly  describe  the  advantages  and  disadvan¬ 
tages  of  both  classes. 

a)  Rigid  Grids: 

Here,  the  grid  is  basically  laid  over  the  volume  to 
be  gridded.  Adaptivity  based  on  surface  curvature 


or  geometric  stiffness  is  used  to  refine  the  grid  close 
to  bodies  immersed  in  the  flowfield.  The  cells  cut 
by  surfaces  are  identified  and  treated  with  special 
procedures.  The  advantages  of  this  approach  are 
manifold: 

-  Construction  of  grids  is  extremely  fast; 

-  The  grid  is  not  moving,  leading  to  a  faster 
solver; 

-  Most  of  the  algorithmic  work  is  concentrated  in 
identifying  and  treating  cut  surface  cells; 

However,  the  approach  also  has  disadvantages: 

-  A  considerable  portion  of  the  grid  may  be 
£blacked  out7; 

-  New  cut  cells  and  boundary  conditions  have  to 
be  established  every  timestep; 

-  Topological  consistency  (water-tightness)  after 
cutting  cells  is  not  always  easy  to  establish; 

-  The  cut  cells  may  become  very  small  (the  treat¬ 
ment  of  small  cells  has  become  the  focus  of  much 
research  [Ber92,  Pem95,  Aft97,  Lan97]); 

-  Grids  suitable  for  RANS  calculations  are  diffi¬ 
cult  to  construct. 

b)  Moving  Grids: 

This  approach  starts  with  a  surface- conforming  grid. 
As  the  structure  deforms,  the  mesh  is  moved.  Exces¬ 
sively  distorted  or  negative  elements  are  automati¬ 
cally  removed.  The  ensuing  voids  are  remeshed  and 
the  solution  interpolated.  If  topological  changes  oc¬ 
cur,  the  surface  and  volume  mesh  have  to  be  regener¬ 
ated,  and  the  solution  reinterpolated.  This  approach 
would  appear  to  have  a  lot  of  disadvantages: 

-  Moving  grid  (ALE)  solvers  are  more  expensive 
than  fixed  grid  solvers  (recalculation  of  Jaco- 
bians,  extra  fluxes,  etc.); 

-  Geometric  conservation  laws  have  to  be  obeyed, 
placing  constraints  on  the  solver  [Les96]; 

-  Moving  the  mesh  requires  extra  computing  re¬ 
sources; 

-  Fast  remeshing  and  interpolation  algorithms  are 
required. 

However,  the  approach  also  has  advantages: 

-  Topological  consistency  (water-tightness)  is 
easy  to  establish; 

-  It  is  possible  to  construct  grids  suitable  for 
RANS  calculations. 

Over  the  last  decade,  we  have  pursued  the  second 
approach,  attempting  to  remedy  the  disadvantages 
by: 

-  Fast  ALE  edge-based  solvers  [Luo93,  Luo94, 
Bau94,  Bau95,  Loh98a]; 
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-  Mass-matrix- based  satisfaction  of  geometric 
conservation  laws  [Bau95]; 

-  Optimal  mesh  movement  algorithms  [L6h96b] ; 

-  Local  remeshing  [Loh90,  Bau98]; 

-  Parallel  unstructured  grid  generation  [Loh96a, 
Loh99]  and 

-  Optimal  interpolation  techniques  for  unstruc¬ 
tured  grids  [Loh95b]. 

In  the  sequel,  we  describe  recent  developments  that 
have  targeted  the  class  of  problems  considered  here: 
fluid-structure  interaction  with  rupture  and  topol¬ 
ogy  change. 

3.  SURFACE  RECONSTRUCTION 

Suppose  that  due  to  cracking,  failure,  spalation,  etc., 
the  'wetted  surface'  of  the  CSD  domain  has  been 
changed.  This  new  surface,  given  by  a  list  of  points 
and  faces,  has  to  be  matched  with  a  correspond¬ 
ing  CFD  surface.  The  CFD  surface  data  typically 
consists  of  surface  segments  defined  by  analytical 
functions  that  do  not  change  in  time  (such  as  ex¬ 
terior  walls,  farfield  boundaries,  etc.),  and  surface 
segments  defined  by  triangulations  (i.e.  discrete 
data)  that  change  in  time.  These  triangulations  are 
obtained  from  the  'wetted  CSD  surface'  at  every 
timestep.  When  a  change  in  topology  is  detected, 
the  new  surface  definition  is  recovered  from  the  dis¬ 
crete  data,  and  joined  to  the  surfaces  defined  ana¬ 
lytically,  as  indicated  in  Figure  3. 


Figure  3  Automatic  Surface  Reconstruction 

The  discrete  surface  is  defined  by  a  support  trian¬ 
gulation,  with  lines  and  end-points  to  delimit  its 
boundaries.  In  this  sense,  the  only  difference  with 
analytically  defined  surfaces  is  the  (discrete)  support 
triangulation.  The  patches,  lines  and  end-points  of 
the  'wetted  CSD  surface'  are  identified  by  compar¬ 
ing  the  unit  surface  normals  of  adjacent  faces.  If 


the  scalar  product  of  them  lies  below  a  certain  tol¬ 
erance,  a  ridge  is  defined.  Comers  are  defined  as 
points  that  are  attached  to: 

-  Only  one  ridge; 

-  More  than  two  ridges;  or 

-  Two  ridges  with  considerable  deviation  of  unit 
side-vector. 

Between  corners,  the  ridges  form  discrete  lines. 
These  discrete  lines  either  separate  or  are  embedded 
completely  (i.e.  used  twice)  in  discrete  surface 
patches.  Figure  4  sketches  the  recovery  of  surface 
features  and  the  definition  of  discrete  surface  patches 
for  a  simple  configuration.  For  more  information,  see 
[Loh96c]. 


Figure  4  Discrete  Surface  Recovery 

For  the  old  surface  definition  data  set,  the  surface 
patches  attached  to  wetted  CSD  surfaces  are  iden¬ 
tified  and  all  information  associated  with  them  is 
discarded.  The  remaining  data  is  then  joined  to  the 
new  wetted  CSD  surface  data,  producing  the  up¬ 
dated  surface  definition  data  set.  This  data  set  is 
then  used  to  generate  the  new  surface  and  volume 
grids. 

The  surface  reconstruction  procedure  may  be  sum¬ 
marized  as  follows: 

-  For  the  Updated  Discrete  Data,  Obtain: 

-  Surface  Patches  +  B.C. 

-  Lines 

-  End-Points 

-  Sources 

-  For  the  Old  Analytical+Discrete  Data: 

-  Remove  Discrete  Data 

-  Reorder  Arrays 

-  Merge: 

-  Old  Analytical  Data 

-  Updated  Discrete  Data 
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4.  ELEMENT  SIZE  AND  SHAPE 

Cracking,  failure,  spalation,  etc.  will  lead  to  the  ap¬ 
pearance  of  many  fragments  or  chunks  in  the  flow- 
field.  The  objects  that  are  flying  in  the  flowfield 
come  in  many  different  sizes,  and  yet  require  suffi¬ 
cient  surface  and  volume  resolution  in  order  for  the 
CFD  solver  to  yield  acceptable  forces,  moments  and 
ultimately  trajectories.  A  very  simple  way  to  ob¬ 
tain  surface  grids  that  are  acceptable  is  by  defin¬ 
ing  a  maximum  element  size  that  is  linked  to  the 
size  of  the  flying  object.  This  approach  can  lead 
to  highly  distorted  volume  grids  when  neighbouring 
objects  with  very  different  surface  discretizations  are 
present.  Compatible  surface  and  volume  grids  may 
be  obtained  by  using  either  adaptive  background 
grids  or  sources  [Loh96a,  Loh97].  In  the  present 
case,  point  sources  were  employed.  The  element  size 
as  a  function  of  the  non-dimensional  distance  from 
the  source  £(x): 


5(x)  =  <5(£(x))  ;  £  =  max  (o,  r°)  ,  (3) 

where  r(x)  is  the  shortest  distance  from  the  source 
to  x,  ro  the  so-called  zone  of  constant  element  size, 
and  rs  the  scaling  length,  is  given  by  the  polynomial: 

£(£)  =  «o  +  +  a2$2  .  (4) 

Once  the  individual  fragments  or  chunks  are  identi¬ 
fied  using  an  advancing  front  neighbour  search  over 
the  discrete  surface  patches,  the  center  of  gravity 
and  the  average  radius  can  be  computed.  This  leads 
to  the  desired  mesh  size  parameter  a 0.  The  parame¬ 
ters  ai,  a2  are  then  obtained  from  mesh  smoothness 
considerations. 

5.  AUTOMATIC  PARALLEL  REMESHING 

Over  the  last  five  years,  major  efforts  have  been 
devoted  to  harness  the  power  of  parallel  computer 
platforms.  While  many  CFD  and  CSD  solvers  have 
been  ported  to  parallel  machines,  grid  generators 
have  lagged  behind.  For  applications  where  remesh¬ 
ing  is  an  integral  part  of  simulations,  e.g.  prob¬ 
lems  with  moving  bodies  [Loh90,  Mes93,  Mes95, 
Bau96,  Kam96,  Loh98b,  Has98]  or  changing  topolo¬ 
gies  [Bau98,  Bau99],  the  time  required  for  mesh  re¬ 
generation  can  easily  consume  more  than  50%  of  the 
total  time  required  to  solve  the  problem.  Faced  with 
this  situation,  a  number  of  efforts  have  been  reported 


on  parallel  grid  generation  [Loh92,  dCo94,  Sho95, 
dCo95,  Oku96,  Che97,  Oku97,  Sai99]. 

The  two  most  common  ways  of  generating  un¬ 
structured  grids  are  the  Advancing  Front  Tech¬ 
nique  (AFT)  [Per87,  Per88,  Loh88a,b,  Per90,  Per92, 
Jin93,  Fry94,  Loh96]  and  the  Generalized  Delau¬ 
nay  Triangulation  (GDT)  [Bak89,  Geo91,  Wea92, 
Wea94,  Mar95].  The  AFT  introduces  one  element 
at  a  time,  while  the  GDT  introduces  a  new  point  at 
a  time.  Thus,  both  of  these  techniques  are,  in  prin¬ 
ciple,  scalar  by  nature,  with  a  large  variation  in  the 
number  of  operations  required  to  introduce  a  new 
element  or  point.  While  coding  and  data  structures 
may  influence  the  scalar  speed  of  the  ‘core’  AFT  or 
GDT,  one  often  finds  that  for  large-scale  applica¬ 
tions,  the  evaluation  of  the  desired  element  size  and 
shape  in  space,  given  by  background  grids,  sources 
or  other  means  [Loh96]  consumes  the  largest  fraction 
of  the  total  grid  generation  time.  Unstructured  grid 
generators  based  on  the  AFT  may  be  parallelized  by 
invoking  distance  arguments,  i.e.  the  introduction  of 
a  new  element  only  affects  (and  is  affected  by)  the 
immediate  vicinity.  This  allows  for  the  introduction 
of  elements  in  parallel,  provided  that  sufficient  dis¬ 
tance  lies  between  them. 


Boundary 

-  Currant  Front 

-  Old  Front 


Figure  5  Parallel  Grid  Generation 

A  convenient  way  of  delimiting  the  possible  zones 
where  elements  may  be  introduced  by  each  processor 
is  via  boxes.  These  boxes  may  be  obtained  in  a  va¬ 
riety  of  ways,  i.e.  via  bins,  binary  recursive  trees,  or 
octrees.  We  have  found  the  octree  to  be  the  best  of 
these  possibilities,  particularly  for  grids  with  a  large 
variation  of  element  size.  In  order  to  recover  a  paral¬ 
lel  gridding  procedure  that  resembles  closely  the  ad¬ 
vancing  front  technique  on  scalar  machines,  only  the 
boxes  covering  the  active  front  in  regions  where  the 
smallest  new  elements  are  being  introduced  are  con¬ 
sidered.  This  has  been  shown  schematically  in  Fig- 
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ure  5a, b  for  a  simle  2-D  domain.  After  these  boxes 
have  been  filled  with  elements  (Figure  5c),  the  pro¬ 
cess  starts  anew:  a  new  octree  is  built  (Figure  5d), 
new  boxes  are  created  (Figure  5e)  and  meshed  in 
parallel  (Figure  5f).  This  cycle  is  repeated  until  no 
faces  are  left  in  the  active  front  (Figures  5g-i). 

At  the  end  of  each  parallel  gridding  pass,  each  one  of 
the  boxes  gridded  can  have  an  internal  boundary  of 
faces.  For  a  large  number  of  boxes,  this  could  result 
in  a  very  large  number  of  faces  for  the  active  front. 
This  problem  can  be  avoided  by  shifting  the  boxes 
slightly,  and  then  regridding  them  again  in  parallel, 
as  shown  in  Figure  6.  This  simple  technique  has  the 
effect  of  eliminating  almost  all  of  the  faces  between 
boxes  with  a  minor  modification  of  the  basic  parallel 
gridding  algorithm. 


Figure  6  Shift  and  Regrid  Technique 

If  we  define  as  dmin  the  minimum  element  size  in 
the  active  front,  and  as  5mzn  the  minimum  box  size 
in  which  elements  are  to  be  generated,  the  parallel 
AFT  proceeds  as  follows: 

WHILE:  There  are  active  faces  left: 

-  Form  an  octree  with  minimum  octant  size  sm2-n 
for  the  active  points; 

-  Retain  the  octants  that  have  faces  that  will  gen¬ 
erate  elements  of  size  dmin  to  c\  •  dmin ; 

-  If  too  many  octants  are  left:  agglomerate  them 
into  boxes; 

-  DO  ISHFT=0 , 2: 

-  IF:  ISHFT.NE.O: 

Shift  the  boxes  by  a  preset  amount; 

-  END IF 

-  Generate,  in  parallel,  elements  in  these 
boxes,  allowing  only  elements  up  to  a  size 
of  C/  •  dmin  5 

-  ENDDO 

Increase  dmin  —  1.5  *  dmini  smin  —  1.5  *  sm{n\ 

ENDWHILE 

The  increase  factor  allowed  is  typically  in  the  range 
d  =  1.5  —  2.0.  Figures  7a-c  show  an  example  ob¬ 
tained  on  an  SGI  Origin  2000  running  in  shared 


memory  mode.  The  case  considered  is  the  garage  of 
an  office  complex,  and  had  approximately  9.2  mil¬ 
lion  tetrahedra.  As  one  can  see,  although  not  per¬ 
fect,  speedups  are  comparable  to  those  of  production 
CFD  codes.  For  more  details  on  the  parallel  grid 
generator,  see  [Loh99] 


Nr.  of  Processors 


Figure  7b  Garage:  Speedups  Obtained 


6.  INTERPOLATION  WITH  TOPOLOGY 
CHANGE 


Once  a  new  mesh  has  been  generated,  the  solution 
from  the  previous  timestep  (on  the  previous  mesh) 
has  to  be  interpolated.  A  series  of  optimal  interpo¬ 
lation  algorithms  for  unstructured  grids  have  been 
described  in  [Loh95b].  Whenever  new  fluid  domains 
are  created  due  to  failure,  cracking  and  spalation, 
interpolating  the  fluid  solution  from  the  previous 
timestep  to  these  new  domains  will  end  in  failure, 
as  there  are  no  possible  host  elements  in  the  old 
mesh.  Figure  8  shows  a  typical  case  where  a  wall 
that  initially  separates  two  rooms  breaks,  changing 
the  topology  of  the  fluid  and  solid  domains. 
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Figure  8  Interpolation  With  Different  Domains 


these  points  can  be  extrapolated  using  different  pro¬ 
cedures: 

-  Advancing  layers  (most  often  used  for  sub¬ 
sonic/isotropic  flows); 

-  Upstream  (used  primarily  for  supersonic  flows); 

-  Closest  known  point  (for  cracks);  or 

-  Via  user-prescribed  subroutine  (for  special 
cases). 

7.  EXAMPLES 

Two  examples  are  included  that  show  the  effective¬ 
ness  of  the  procedures  developed  to  date.  In  both 
instances,  the  CSD  solver  used  is  GA-DYNA  [Pel95, 
Pel97,  Pel98]  and  the  CFD  solver  used  is  FEM-FCT 
within  FEFL098  [Loh87,  ]. 


It  is  therefore  important  to  identify  points  of  the 
new  mesh  that  lie  outside  the  confines  of  the  old 
mesh.  A  simple  way  that  has  proven  successful  is  to 
form  a  Cartesian  mesh  or  bins  (Figure  9).  A  loop  is 
then  performed  over  the  elements  of  the  old  mesh, 
marking  the  bins  covered  by  elements.  In  a  second 
loop  over  the  points  of  the  new  grid,  all  points  that 
fall  into  bins  not  covered  by  the  old  grid  are  marked 
as  impossible  to  interpolate.  This  procedure  can  be 
done  recursively  by  obtaining  the  confines  of  the  vol¬ 
ume  where  points  have  been  marked  as  impossible, 
leading  to  so-called  Telescoping5  of  the  bin  search 
region. 


Figure  9  Marking  Impossible  Points 


6.1  Fragmenting  Weapon: 


The  first  case  considered  is  that  of  a  fragmenting 
weapon.  The  detonation  and  shock  propagation 
were  modeled  using  a  JWL  equation  of  state.  At  the 
beginning,  the  walls  of  the  weapon  separate  two  flow 
domains:  the  inner  one,  consisting  of  high  explosive, 
and  the  outer  one,  consisting  of  air.  As  the  structure 
of  the  weapon  begins  to  fail,  fragments  are  shrunk 
and  the  ensuing  gaps  are  automatically  remeshed, 
leading  to  one  contuinuous  domain.  The  mesh  in  the 
fluid  domain  was  adapted  using  sources  for  geomet¬ 
ric  fidelity  [Loh96a]  and  the  modified  H2-seminorm 
error  indicator  proposed  in  [Loh87,  Loh92].  At  the 
end  of  the  run,  the  flow  domain  contains  approxi¬ 
mately  750  independently  flying  bodies  and  16  mil¬ 
lion  elements.  Figures  8-10  show  the  development  of 
the  detonation.  The  fragmentation  of  the  weapon  is 
clearly  visible.  Figure  11  shows  the  correlation  with 
the  observed  experimental  evidence. 


No  attempt  is  then  made  to  interpolate  the  points 
marked  as  outside  the  old  mesh.  The  unknowns  for 
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Euler  equations  with  ideal  gas  equation  of  state. 


Figure  10b  Fragmenting  weapon  at  310msec 
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Figure  11a  Pressure  at  t=0. 0001, 0.0002 


Figure  10c  Fragmenting  weapon  at  500msec 


Figure  lOd  Radial  velocity  as  a  function 
of  fragment  weight 


6.2  Blast  Close  to  a  Wall 

This  example  shows  the  capabilities  of  the  present 
fluid-structure  interaction  methodology  to  treat  sit¬ 
uations  with  extreme  transient  loading,  plastic  de¬ 
formation,  failure,  breakup,  topology  change  and  au¬ 
tomatic  remeshing.  The  blast  was  modeled  using  the 


Figure  11c  Pressure  at  t=0. 0005, 0.0006 


Figure  lid  Pressure  at  t=0. 0008, 0.0010 
Initially,  a  ‘high  energy’  state  given  by:  p 
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0.25  gr/cm3}  v  =  0.00  cm/ sec,  p  =  0.43e  +  09  dynes 
within  the  sphere  with  radius  r  =  8.0  cm  centered 
at  x  =  34.29  cm,  y  —  —10.00  cm,  z  =  19.05  cm 


Figure  12a  Surface  Velocity  at  t=0.0002 


Figure  12b  Surface  Velocity  at  t=0.0004 


Figure  12c  Surface  Velocity  at  t=0.0006 
in  the  lower  room,  and  a  quiescent  air  state  given 


by:  p  =  0.00104  gr/cm3,v  =  0.00  cm/sec,  p  — 
0.86185e  +  06  dynes  was  prescribed.  The  wall  was 
assumed  to  be  of  reinforced  concrete.  The  mate¬ 
rial  model  used  was  elasto/perfect  plastic.  Failure 
was  assumed  to  occur  when  the  average  strain  in 
an  element  exceeded  60%.  At  the  beginning,  the 
wall  separates  the  two  rooms.  As  the  wall  begins 
to  fail  the  larger  fragments  are  shrunk  while  keeping 
their  mass  and  moments  of  inertia  intact,  leading  to 
gaps.  The  smaller  fragments  are  removed  completely 
from  the  structure  and  converted  to  spheres  that  ex¬ 
change  mass,  momentum  and  energy  with  the  fluid, 
exchange  momentum  with  the  remaining  structure 
via  contact,  but  can  not  interact  between  them,  nor 
'occupy  space’.  The  ensuing  gaps  that  result  due 
to  shrinkage  or  element  removal  are  automatically 
remeshed  and  filled  with  fluid.  At  some  point  during 
the  run,  the  flow  domain  contains  approximately  250 
independently  flying  bodies  and  1  million  elements. 
Figures  lla-d  show  a  planar  cut  through  the  two 
rooms.  The  impact  of  the  blast  wave  on  the  wall,  its 
reflection,  the  ensuing  wall  failure  and  eventual  pen¬ 
etration  of  the  blast  wave  into  the  adjacent  room 
are  clearly  visible.  Figures  12a-c  show  the  surface 
velocity  and  fragments  of  the  wall. 

8.  CONCLUSIONS  AND  OUTLOOK 

Several  methodologies  that  are  essential  for  the  rou¬ 
tine  simulation  of  fluid-structure  interaction  prob¬ 
lems  where  rupture  and  topology  change  are  present 
have  been  developed.  These  include: 

-  Surface  and  topology  reconstruction; 

-  Automatic  sizing  techniques; 

-  Parallel  remeshing;  and 

-  Interpolation  algorithms  for  topologically  differ¬ 
ent  domains. 

Numerical  examples  indicate  that  with  the  devel¬ 
oped  methodology  a  new  level  of  realism  and  so¬ 
phistication  has  been  achieved  for  this  class  of  prob¬ 
lems.  Among  the  many  outstanding  issues  we  just 
mention: 

-  Parallelization  for  distributed  memory  ma¬ 
chines; 

-  Automatic  remeshing  for  RANS  simulations; 
and 

-  Proper  treatment  of  cracking. 

At  the  threshold  of  a  new  century,  we  envision  a 
multi-disciplinary,  database-linked  framework  that 
is  accessible  from  anywhere  on  demand,  simulations 
with  unprecedented  detail  and  realism  carried  out 
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in  fast  succession,  virtual  meeting  spaces  where  ge¬ 
ographically  displaced  designers  and  engineers  dis¬ 
cuss  and  analyze  collaboratively  new  ideas,  and  first- 
principles  driven  virtual  reality. 
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Abstract 

Two  renumbering  strategies  for  field  solvers  based  on  unstructured  grids  that  operate  on  shared- memory,  cache-based  parallel  machines 
are  described.  Special  attention  is  paid  to  the  avoidance  of  cache-line  overwrite,  which  can  lead  to  drastic  performance  degradation  on  this 
type  of  machines.  Both  renumbering  techniques  avoid  cache-misses  and  cache-line  overwrite  while  allowing  pipelining,  leading  to  optimal 
coding  for  this  type  of  hardware.  ©  1998  Elsevier  Science  S.A.  All  rights  reserved. 


1.  Introduction 


There  can  hardly  be  any  doubt  that  the  1990s  are  the  decade  of  parallelism.  Even  though  machines  with 
several  powerful  vector-processors  were  installed  at  many  places  in  the  1980s,  the  operating  system  or  the 
system  administration  hardly  allowed  for  the  use  of  several  processors  during  the  same  run.  Moreover,  in  many 
instances  the  compilers  were  not  mature,  leading  to  meaningless  gains  in  performance.  With  the  advent  of 
massively  parallel  machines,  i.e.  machines  in  excess  of  500  nodes,  the  exploitation  of  parallelism  in  solvers  has 
become  a  major  focus  of  attention.  According  to  Amdahl’s  Law,  the  speed-up  s  obtained  by  parallelizing  a 
portion  a  of  all  operations  required  is  given  by 


■  +  (!-«) 


(1) 


where  Rs,Rp  denote  the  scalar  and  parallel  processing  rates  (speeds),  respectively.  Table  1  shows  the  speed-ups 
obtained  for  different  percentages  of  parallelization  and  different  numbers  of  processors. 

Note  that  even  on  a  traditional  shared-memory,  multiprocessor  vector  machine,  such  as  the  CRAY  T-90  with 
16  processors,  the  maximum  achievable  speed-up  between  scalar  code  and  parallel  vector  code  is  a  staggering 
/^s  ~  240.  What  is  important  to  note  is  that  as  we  migrate  to  higher  numbers  of  processors,  only  the 


Table  1 


Speed-ups  obtainable  (Amdahl’s  Law) 


V*. 

50% 

90% 

99% 

99.9% 

10 

1.81 

5.26 

9.17 

9.91 

100 

1.98 

9.90 

50.25 

90.99 

1000 

2.00 

9.91 

90.99 

500.25 
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embarrassingly  parallel  codes  will  survive.  Most  of  the  applications  ported  successfully  to  parallel  machines  to 
date  have  followed  the  Single  Program  Multiple  Data  (SPMD)  paradigm.  For  grid-based  solvers,  a  spatial 
subdomain  was  stored  and  updated  in  each  processor.  For  particle  solvers,  groups  of  particles  were  stored  and 
updated  in  each  processor.  For  obvious  reasons,  load  balancing  [1-4]  has  been  a  major  focus  of  activity. 

Despite  the  striking  successes  reported  to  date,  only  the  simplest  of  all  solvers:  explicit  timestepping  or 
implicit  iterative  schemes,  perhaps  with  multigrid  added  on,  have  been  ported  without  major  changes  and/or 
problems  to  massively  parallel  machines  with  distributed  memory.  Many  code  options  that  are  essential  for 
realistic  simulations  are  not  easy  to  parallelize  on  this  type  of  machine.  Among  these,  we  mention  local 
remeshing  [5],  repeated  /^-refinement,  such  as  required  for  transient  problems  [6],  contact  detection  and  force 
evaluation  [7],  some  preconditioners  [8],  applications  where  particles,  flow,  and  chemistry  interact,  and 
applications  with  rapidly  varying  load  imbalances.  Even  if  99%  of  all  operations  required  by  these  codes  can  be 
parallelized,  the  maximum  achievable  gain  will  be  restricted  to  1:100.  If  we  accept  as  a  fact  that  for  most 
large-scale  codes  we  may  not  be  able  to  parallelize  more  than  99%  of  all  operations,  the  shared  memory 
paradigm,  discarded  for  a  while  as  non-scalable,  will  make  a  comeback.  It  is  far  easier  to  parallelize  some  of  the 
more  complex  algorithms,  as  well  as  cases  with  large  load  imbalance,  on  a  shared  memory  machine.  And  it  is 
within  present  technological  reach  to  achieve  a  100  processor,  shared  memory  machine.  Such  an  alternative,  i.e. 
having  less  expensive  RISC  chips  linked  via  shared  memory,  is  currently  being  explored  by  a  number  of 
vendors.  One  example  of  such  machines  is  the  SGI  Power  Challenge,  which  at  the  time  of  writing  allows  up  to 
18  processors  to  work  in  shared  memory  mode  on  a  problem,  with  upgrades  to  92  processors  planned  within  the 
next  two  years.  In  order  to  obtain  proper  performance  from  such  a  machine,  the  codes  must  be  written  in  such  a 
way  as  to  avoid: 

(a)  Cache-misses  (in  order  to  perform  well  on  each  processor); 

(b)  Cache  overwrite  (in  order  to  perform  well  in  parallel);  and 

(c)  Memory  contention  (in  order  to  allow  pipelining). 

Thus,  although  in  principle  a  good  compromise,  shared  memory,  RISC-based  parallel  machines  actually 
require  a  fair  degree  of  knowledge  and  reprogramming  for  codes  to  run  optimally. 

The  present  paper  describes  renumbering  techniques  for  field  solvers  operating  on  unstructured  grids  that  have 
proven  useful  for  machines  of  this  kind.  A  summary  of  the  remainder  of  the  paper  follows.  Sections  2  and  3 
recall  some  previously  described  renumbering  techniques  to  minimize  cache-misses  and  achieve  pipelining. 
Section  4  treats  cache  overwrite,  a  new,  and  previously  not  accounted-for  design  requirement  for  renumbering 
strategies.  In  Section  5,  implementational  issues  are  considered.  Section  6  reports  several  scalability  studies 
obtained  on  SGI  Power  Challenge  and  SGI  Origin  systems.  Finally,  some  conclusions  are  drawn  in  Section  7. 


2.  Renumbering  to  avoid  cache  misses 

Consider  the  following  loop  over  edges  that  typifies  the  central  loop  of  many  field  solvers  based  on 
unstructured  grids  [9—14],  Similar  loops  are  obtained  for  element-  or  face-based  solvers,  and  what  follows  is 
equally  applicable  to  them.  A  right-hand  side  (RHS),  or  residual,  is  formed  at  the  edge-level  by  gathering 
information  from  a  vector  of  unknowns.  This  edge-RHS  is  then  added  to  a  global  point-RHS.  The  operations  are 
shown  schematically  in  Fig.  1,  and  a  typical  FORTRAN  implementation  would  be  the  following: 

Loop  1 

do  1600  iedge=l , nedge 
ipoil=lneod ( 1 , iedge ) 
ipoi2=lnoed ( 2 , iedge ) 

redge=geoed(  iedge) *  (unkno ( ipoi2 ) -unkno { ipoil ) ) 
rhspo ( ipoil ) =rhspo ( ipoil )  + redge 
rhspo ( ipoi2 ) =rhspo ( ipoi2 )  -  redge 
1600  continue 

If  cache-misses  are  a  concern,  then  it  is  clear  that  the  storage  locations  for  the  required  point  information  stored 
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• - - « 

ipoil  ipoi2 


Fig.  1 .  Information  flow  for  edge-based  loop. 

in  the  arrays  unkno  and  rhspo  should  be  as  close  as  possible  in  memory  when  required  by  an  edge.  At  the 
same  time,  as  the  loop  progresses  through  the  edges,  the  point  information  should  be  accessed  as  uniformly  as 
possible.  This  may  be  achieved  by  first  renumbering  the  points  using  a  bandwidth-minimization  technique  (e.g. 
Reverse  Cuthill  and  McKee  [15],  wavefront  [16],  recursive  bisection),  and  subsequently  renumbering  the  edges 
according  to  the  minimum  point  number  on  each  edge  [16],  All  of  these  algorithms  are  of  complexity  O (N)  or  at 
most  0(N  log  N),  and  are  well  worth  the  effort. 


3.  Avoidance  of  memory  contention 

Pipelining  or  vectonzation  offers  the  possibility  of  substantial  performance  gain  on  any  kind  of  system.  While 
previously  restricted  to  so-called  vector  machines,  such  as  those  manufactured  by  CRAY,  Convex,  NEC,  Fujitsu 
or  Hitachi,  the  concept  has  migrated  to  current  RISC  chips,  such  as  the  MIPS  R8000  and  R10000.  For  the  latter 
chip,  so-called  software  pipelining  is  invoked  by  the  compiler  for  certain  optimization  options.  In  order  to 
achieve  pipelining  or  vectorization,  memory  contention  issues  must  be  avoided.  The  enforcement  or  pipelining 
or  vectonzahon  is  then  carried  out  using  a  compiler  directive,  as  Loop  1,  which  becomes  an  inner  loop,  still 
offers  the  possibility  of  memory  contention.  In  this  case,  we  would  have: 

Loop  2 

do  1400  ipass=l,npass 
nedgO=edpas { ipass)  +1 
nedgl=edpas { ipass  +  1 ) 

c$dir  ivdep  , 

do  1600  iedge=nedg0,nedgl 
ipoil=lnoed ( 1 , iedge ) 
ipoi2  =lnoed ( 2 , iedge ) 
redge=geoed (  iedge) *  (unkno ( ipoi2 ) 
rhspo ( ipoil ) =rhspo ( ipoil )  +  redge 
rhspo { ipoi2 ) =rhspo ( ipoi2 )  -  redge 
1600  continue 
1400  continue 

It  is  clear  that  in  order  to  avoid  memory  contention,  for  each  of  the  groups  of  edges  (1600  loop),  none  of  the 


Pipelining  directive 


-  unkno { ipoil ) ) 
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1  NPOIN 


Fig.  2.  Point-range  covered  by  each  group  of  edges;  1 -Processor  machine;  renumbering  to  minimize  cache-misses  and  avoid  memory 
contention. 


corresponding  points  may  be  accessed  more  than  once.  Given  that  in  order  to  achieve  good  pipelining 
performance  oh  current  RISC-chips  a  relatively  short  vector  length  of  16  is  sufficient,  one  can  simply  start  from 
the  edge-renumbering  obtained  in  order  to  minimize  cache-misses,  and  renumber  it  further  into  groups  of  edges 
that  are  16  long  and  avoid  memory  contention  [16].  As  before,  this  renumbering  is  of  complexity  O (N).  The 
resulting  loop  is  shown  schematically  in  Fig.  2. 


4.  Cache  line  overwrite 

The  next  stage  is  to  port  Loop  2  to  a  parallel,  shared  memory  machine.  If  the  loop  is  left  untouched,  the 
auto-parallelizing  compiler  will  simply  split  the  inner  loop  across  processors.  It  would  then  appear  that 
increasing  the  vector-length  to  a  sufficiently  large  value  would  offer  a  satisfactory  solution.  However,  this  is  not 
advisable  for  the  following  reasons: 

(a)  Every  time  a  parallel  do-loop  is  launched,  a  start-up  time  penalty,  equivalent  to  several  hundred  Flops  is 
incurred.  This  implies  that  if  scalability  to  even  16  processors  is  to  be  achieved,  the  vector  loop  lengths 
would  have  to  be  16*1000.  For  typical  tetrahedral  grids  we  encounter  approximately  22  maximum 
vector-length  groups,  indicating  that  we  would  need  at  least  22*16*1000  =  352  000  edges  to  run 
efficiently. 

(b)  Because  the  range  of  points  in  each  group  increases  at  least  linearly  with  vector  length,  so  do  cache 
misses.  This  implies  that  even  though  one  may  gain  parallelism,  the  individual  processor  performance 
would  degrade.  The  end  result  is  a  very  limited,  non-scalable  gain  in  performance. 

(c)  Because  the  points  in  a  split  group  access  a  large  portion  of  the  edge-array,  different  processors  may  be 
accessing  the  same  cache-line.  When  a  ‘dirty  cache-line’  overwrite  occurs,  all  processors  must  update  this 
line,  leading  to  a  large  increase  of  interprocessor  communication,  severe  performance  degradation  and 
non-scalability.  Experiments  on  an  8-processor  SGI  Power  Challenge  showed  a  maximum  speed-up  of 
only  1:2.5  when  using  this  option.  This  limited  speed-up  was  attributed,  to  a  large  extent,  to  cache-line 
overwrites. 

In  view  of  these  consequences,  additional  renumbering  strategies  have  to  be  implemented.  In  the  following, 
we  discuss  two  edge-group  agglomeration  techniques  that  minimize  cache  misses,  allow  for  pipelining  on  each 
processor,  and  avoid  cache  overwrite  across  processors.  Both  techniques  operate  on  the  premise  that  the  points 
accessed  within  each  parallel  inner  edge-loop  (1600  loop)  do  not  overlap. 

Before  going  on,  we  define  edmin  ( 1  :npass)  ,  edmax  ( 1  rnpass)  to  be  the  minimum  and  maximum  point 
accessed  within  each  group,  nproc  the  number  of  processors,  and  the  point-range  of  each  group  ipass  as 
[edmin ( ipass) f  edmax ( ipass) ] . 
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4.1.  Local  agglomeration 

The  first  way  of  achieving  pipelining  and  parallelization  is  by  processing,  in  parallel,  nproc  independent 
vector-groups  whose  individual  point-range  does  not  overlap.  The  idea  is  to  renumber  the  edges  in  such  a  way 
that  nproc  groups  are  joined  together  where,  for  each  one  of  these  groups,  the  point  ranges  do  not  overlap  (see 
Fig.  3).  As  each  one  of  the  sub-groups  has  the  same  number  of  edges,  the  load  is  balanced  across  the  processors. 
The  actual  loop  is  given  by 

Loop  3 

do  1400  impass=l , npass 
nedgO=edpas ( ipass)  +1 
nedgl^edpas  ( ipass  4- 1 ) 

c  !  Parallelization  directive 

c$doacross  local ( iedge, ipoil , ipoi2 , redge) 

c$dir  ivdep  !  Pipelining  directive 

do  1600  iedge=nedgO , nedgl 
ipoil=lnoed ( 1 , iedge ) 
ipoi2=lnoed ( 2 , iedge ) 

redge=geoed  (  iedge)  *  (unkno  ( ipoi2  )  -  unkno  ( ipoil ) ) 
rhspo ( ipoil ) =rhspo ( ipoil )  + redge 
rhspo ( ipoi2 ) =rhspo ( ipoi2 )  -  redge 
1600  continue 
1400  continue 


Note  that  the  number  of  edges  in  each  pass,  i.e.  the  difference  nedgl  -  nedgO  + 1  is  now  nproc  times  as  large 
as  in  the  original  Loop  2.  As  one  can  see,  this  type  of  renumbering  entails  no  code  modifications,  making  it  very 
attractive  for  large  production  codes.  However,  a  start-up  cost  is  incurred  whenever  a  loop  across  processors  is 
launched.  This  would  indicate  that  long  vector-lengths  should  be  favoured.  However,  cache-misses  increase  with 
vector-length,  so  that  this  strategy  only  yields  a  limited  speed-up.  This  leads  us  to  the  second  renumbering 
strategy. 


i 


NPOIN 


Fully  Parallel 
Region 


♦ 

11  Processor 
Region 


Fig.  3.  Point-range  covered  by  each  group  of  edges;  3-Processor  machine;  renumbering  to  minimize  cache-misses  and  avoid  memory 
contention. 
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4.2.  Global  agglomeration 

A  second  way  of  achieving  pipelining  and  parallelization  is  by  processing  all  the  individual  vector-groups  in 
parallel  at  a  higher  level  (see  Fig.  4).  In  this  way,  short  vector  lengths  can  be  kept  and  low  start-up  costs  are 
achieved.  As  before,  the  point-range  between  macro-groups  must  not  overlap.  This  renumbering  of  edges  is 
similar  to  domain-splitting  [1,4,17-19],  but  does  not  require  an  explicit  message  passing  or  actual  identification 
of  domains.  Rather,  all  the  operations  are  kept  at  the  (algebraic)  array  level.  The  number  of  sub-groups,  as  well 
as  the  total  number  of  edges  to  be  processed  in  each  macro-group,  is  not  the  same.  However,  the  imbalance  is 
small,  and  does  not  affect  performance  significantly.  The  actual  loop  is  given  by 


Loop  4 


do  1000  imacg=l , npasg, nproc 
imac0=  imacg 

imacl=min (npasg, imacO  +  nproc  -  1 ) 

c  -  Parallelization  directive 

c$doacross  local (ipasg, ipass,npas0, 

npasl , iedge , nedgO , nedgl , 
c$&  ipoil , ipoi2 , redge) 

do  1200  ipasg=imac0 , imacl 
npas0=edpag( ipasg)  +1 
npasl=edpag ( ipasg  +  1 ) 
do  1400  ipass=npas0 , npasl 
nedgO  =edpas  ( ipass)  +1 
nedgl =edpas ( ipass  + 1 ) 

c$dir  ivdep  ,  Pipelining  directive 

do  1600  iedge=nedg0, nedgl 
ipoil=lnoed ( 1 , iedge ) 
ipoi2=lnoed ( 2 , iedge ) 

redge=geoed(  iedge)  *  (unkno  ( ipoi2 )  -  unkno  ( ipoil ) ) 
rhspo ( ipoil ) =rhspo ( ipoil )  +  redge 
rhspo ( ipoi2 ) =rhspo ( ipoi2 )  -  redge 
1600  continue 

1400  continue 

1200  continue 
1000  continue 


Fully  Para 
Region 


- _ H-  A  1-Processc 

NEDGE  Lj  r* -  »|  |  Region 

Fig  4.  Point-range  covered  by  each  group  of  edges;  3-processor  machine;  renumbering  to  minimize  cache-misses.  Avoid  memory 
contention  and  minimize  start-up  costs.  J 
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As  one  can  see^  this  type  of  renumbering  entails  two  outer  loops,  implying  that  a  certain  amount  of  code  rewrite 
is  required.  On  the  other  hand,  the  original  code  can  easily  be  retrieved  by  setting  edpag(i)=o 
edpag  ( 2  )  =npas ,  npasg=l  for  conventional  uniprocessor  machines. 

If  the  length  of  the  cache-line  is  known,  one  may  relax  the  restriction  of  non-overlapping  point  ranges  to 

alTcL^  msted8toadatehne  ra"8eS'  ^  aH°WS  ^  flexibUity’  and  leads  to  almost  Perfect  load  balance  for 
A  simple  edge-renumbering  scheme  that  we  have  found  effective  is  the  following  multipass  algorithm: 

5.1.  Pass  1:  Agglomerate  in  groups  of  edges  with  point-range  npoin/nproc,  setting  the  maximum  number  of 
groups  in  each  macro-group  naggl  to  the  number  of  groups  obtained  for  the  range  1 :  npoin  /  nproc ; 

5.2.  Passes  2jf: 

-  For  the  remaining  groups:  determine  the  point-range; 

-  Estimate  the  range  of  the  first  next  macro-group  from  the  point  range  and  the  number  of  processors- 

-  Agglomerate  the  groups  in  this  range  to  obtain  the  first  next  macro-group,  and  determine  the  number  of 
groups  tor  each  macro-group  in  this  pass  naggl; 

...  -  KL,a'th  thougil  ,he  remaining  groups  of  edges,  attempfing  to  obtain  macro-groups  of  length  naggl 
Although  not  optimal,  this  simple  slrategy  yields  balanced  macro-gtoups.  as  can  be  seen  from  the  examples 

tom  (IJe  g  "  '°a<l  balanCi"S  a'80ri‘hmS  P0SSiMe'  “  eV'de“Cd  by  “  ^  of 


5.  Implementational  issues 


For  large-scale  codes,  having  to  re-write  and  test  several  hundred  subroutines  can  be  an  onerous  burden.  To 
m  e  matters  worse,  present  compilers  force  the  user  to  declare  explicitly  the  local  and  shared  variables.  This  is 
easily  done  for  simple  loops  such  as  the  one  described  above,  but  can  become  involved  for  the  complex  loops 
with  many  scalar  temporaries  that  characterize  advanced  CFD  schemes  written  for  optimal  cache  reuse.  We  have 
found  that  m  some  cases,  compilers  may  refuse  to  parallelize  code  that  has  all  variables  declared  properly  A 
technique  that  has  always  worked,  and  that  reduces  the  amount  of  variables  to  be  declared,  is  to  write 
sub-subroutines.  For  Loop  4,  this  translates  into: 


Master  Loop  4 

do  1000  imacg=l ,  npasg,  proc 
imac0=  imacg 

imacl=min (npasg, imacO +  nproc  -  1 ) 

c  !  Parallelization  directive 

c$doacross  local (ipasg) 

do  1200  ipasg=imac0 ,  imacl 
call  loop2p ( ipasg) 

1200  continue 
1000  continue 

Loop  2p  becomes  a  subroutine  of  the  form: 

subroutine  loop2p ( ipasg) 
npas0=edpag{  ipasg)  +1 
npasi=edpag ( ipasg  +  1 ) 
do  1400  ipass=npas0 , npasl 
nedg0=edpas  ( ipass)  +1 
nedgl=edpas ( ipass  +  1 ) 
c$dir  ivdep 

do  1600  iedge=nedg0 , nedgl 


Pipelining  directive 
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ipoil=lnoed ( 1 , iedge ) 
ipoi2=lnoed ( 2 , iedge) 

redge=geoed (  i.edge) * (unkno ( ipoi2 )  unkno ( ipoil ) ) 
rhspo ( ipoil ) =rhspo ( ipoil )  + redge 
rhspo ( ipoi2 ) =rhspo ( ipoi2 )  -  redge 
1600  continue 
1400  continue 


6.  Example  timings 

The  renumbering  strategies  described  were  coded  into  FEFL097,  an  adaptive,  edge-based  finite  element  code 
for  the  solution  of  compressible  and  incompressible  flows  [20].  The  compressible  solver  incorporates,  among 
other  options  van  Leer’s  flux-vector  and  Roe’s  flux-difference  splitting  techniques.  The  incompressible  solver  is 
based  on  a  projection  technique,  implying  that  the  bulk  of  the  CPU  time  is  spent  in  a  Laplacian  loop  of  precisely 
the  type  discussed  above.  The  first  two  cases  were  run  on  an  SGI  Power  Challenge  with  6  R8000  processors,  4 
Mbytes  of  cache  and  512  Mbytes  of  memory,  whereas  the  third  case  was  run  on  an  SGI  Origin  2000  system 
with  8  R 10000  processors,  4  Mbytes  of  cache  and  4  Gbytes  of  memory. 

6.1.  F-117 

The  surface  mesh,  as  well  as  the  (unconverged)  solution  after  50  timesteps  are  shown  in  Fig.  5(a,b).  The  mesh 
had  approximately  280  Ktetra,  52  Kpts,  8.6  Kboundary  points  and  340  Kedges.  After  renumbering,  the 


Mesh 


Fig.  5.  F-117  (transonic),  (a)  Surface  mesh;  (b)  pressure  (MaTO  =  0.65). 
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Table  2 


ipasg 

npasO 

npasl 

(a)  F-117  problem  on  2  processors 

1  1 

8705 

l 

8706 

17410 

3 

17411 

18936 

4 

18937 

20462 

5 

20463 

20760 

6 

20761 

21058 

7 

21059 

21263 

(b)  F-117  problem  on  4  processors 

1  1 

3791 

2 

3792 

7582 

3 

7583 

11373 

4 

11374 

15164 

5 

15165 

16054 

6 

16055 

16944 

7 

16945 

17834 

8 

17835 

18724 

9 

18725 

19231 

10 

19232 

19738 

11 

19739 

20245 

12 

20246 

20590 

13 

20591 

20724 

14 

20725 

20858 

17 

20859 

21263 

(c)  F-117  problem  on  6  processors 

1  1 

2160 

l 

2161 

4320 

3 

4321 

6480 

4 

6481 

8640 

5 

8641 

10800 

6 

10801 

12960 

7 

12961 

13555 

8 

13556 

14150 

9 

14151 

14745 

10 

14746 

15340 

1 1 

15341 

15935 

12 

15936 

16530 

13 

16531 

17161 

14 

17162 

17792 

15 

17793 

18423 

16 

18424 

19054 

17 

19055 

19685 

18 

19686 

20121 

19 

20122 

20273 

20 

20274 

20425 

21 

20426 

20577 

22 

20578 

20729 

25 

20730 

20857 

26 

20858 

20985 

27 

20986 

21052 

31 

21053 

21180 

37 

21181 

21263 

(d)  F- 117:  Actual  vs.  optimal  edge-allocation 


nproc  actual  %  optimal  % 

2  50.482  50.00 

4  26.934  25.00 

6  18.234  16.67 

(e)  Timings  for  F-117  problem 

nproc _ time  (s)  CPU  (s) 


1  919  897.0 

2  485  942.5 

1  281  1087.5 

5 _  220  1235.1 


loopl _ ipmin  ipmax 


139280 

139280 

24416 

24416 

4768 

4767 

3280 


60656 

60656 

60656 

60656 

14240 

14240 

14240 

14240 

8112 

8112 

8112 

5519 

2144 

2144 

6480 


34560 

34560 

34560 

34560 

34560 

34560 

9520 

9520 

9520 

9520 

9520 

9520 

10096 

10096 

10096 

10096 

10096 

6975 

2432 

2432 

2432 

2432 

2048 

2048 

1072 

2048 

1328 


23588 

19980 

45465 

49486 

50875 

50319 


1 

10709 

23029 

35631 

8665 

19470 

32241 

45458 

19430 

34507 

48784 

50729 

22706 

50168 

47928 


1 

6413 

14021 

22668 

31487 

39751 

4858 

12696 

20533 

28369 

36851 

45325 

6247 

19030 

27860 

38363 

48192 

50497 

12674 

30955 

47184 

49873 

22256 

31320 

50319 

22573 

47985 


23588 

46686 

27264 

50160 

50832 

51874 

51320 


10704 

23020 

35623 

46685 

13101 

25301 

37526 

48692 

26447 

48782 

50727 

51874 

48803 

51041 

51226 


6282 

13918 

22627 

31484 

39596 

46585 

7897 

17070 

25686 

33391 

40814 

47920 

15424 

24188 

34282 

48191 

50496 

51874 

25973 

34624 

48525 

50832 

26268 

48657 

51007 

48853 

49065 


loss  % 

To 

7.7 

9.4 


Speedup 

Too 

1.89 

3.27 

4.17 


/  \ 
i 
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maximum  and  average  bandwidths  were  3797  and  2510.  The  vector-loop  length  was  set  to  16,  which  was  found 
to  be  sufficient  for  good  performance  on  the  SGI  Power  Challenge.  The  grouping  of  edges  according  to  the 
number  of  processors  is  given  in  Table  2(a-c).  The  corresponding  percentage  of  edges  processed  by  the 
processor  with  the  maximum  number  of  edges,  as  well  as  the  theoretical  loss  of  performance  due  to  imbalance 
(ratio  of  actual  work  carried  out  by  this  processor  vs.  the  minimum  possible  work)  is  shown  in  Table  2(d).  Table 
2(e)  summarizes  the  clock  time  and  total  CPU  time,  as  well  as  the  speed-ups  obtained  for  a  run  of  50  timesteps, 
including  i/  o,  renumbering,  etc.  As  one  can  see,  the  performance  degrades  with  the  number  of  processors.  This 
is  to  be  expected,  as  the  increasing  number  of  passes  results  in  higher  relative  loop  costs,  and  portions  of  the 
code  (i/o,  renumbering,  indirect  data  structures,  some  residual  sums,  etc.)  are  still  running  in  uni-processor 
mode.  Given  that  the  machine  used  only  had  6  processors,  timings  obtained  for  the  6  processor  case  may  be 
higher  than  expected. 


6.2.  Sphere 


The  surface  mesh,  as  well  as  the  solution  after  50  timesteps  are  shown  in  Fig.  6(a,b).  The  mesh  had 
approximately  332  Ktetra,  61  Kpts,  8.8  Kboundary  points  and  402  Kedges.  After  renumbering,  the  maximum 
and  average  bandwidths  were  1993  and  1411.  The  vector-loop  length  was  again  set  to  16.  The  grouping  of  edges 
according  to  the  number  of  processors  is  given  in  Table  3(a— c).  The  corresponding  percentage  of  edges 
processed  by  the  processor  with  the  highest  number  of  edges,  as  well  as  the  theoretical  loss  of  performance  due 
to  imbalance  is  shown  in  Table  3(d).  Table  3(e)  summarizes  the  speed-ups  obtained  for  a  run  of  50  timesteps, 
including  i/o,  renumbering,  etc.  Note  that  the  speed-up  is  superlinear  up  to  four  processors.  A  convincing 
explanation  of  this  phenomenon  is  still  elusive,  but  we  speculate  on  reduced  cache-misses  for  the  multiprocessor 
runs.  As  before,  the  machine  used  had  a  total  of  6  processors,  so  that  the  timings  for  the  6  processor  case  have  to 
be  qualified.  The  bulk  of  the  work  for  this  incompressible  flow  case  is  performed  in  a  Laplacian-like  inner  loop 
over  edges,  showing  that  the  data  structures  discussed  perform  well. 


Mesh 


Pressure 


Fig.  6.  Sphere  (incompressible),  (a)  Surface  mesh;  (b)  pressure. 
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Table  3 

ipasg  npasO  npasl 

(a)  Sphere  problem  on  2  processors 


1  10753 


2 

10754 

21506 

3 

21507 

22282 

4 

22283 

23058 

5 

23059 

23937 

6 

23938 

24816 

7 

24817 

35118 

(b)  Sphere  problem  on  4 

processors 

1 

1 

5023 

2 

5024 

10046 

3 

10047 

15069 

4 

15070 

20092 

5 

20093 

21097 

6 

21098 

22102 

7 

22103 

23107 

8 

23108 

23940 

9 

23941 

24175 

10 

24176 

24410 

11 

24411 

24645 

13 

24646 

24773 

14 

24774 

24836 

17 

24837 

25118 

(c)  Sphere  problem  on  6  processors 

1 

1 

3349 

2 

3350 

6698 

3 

6699 

10047 

4 

10048 

13396 

5 

13397 

16745 

6 

16746 

18797 

7 

18798 

19639 

8 

19640 

20481 

9 

20482 

21323 

10 

21324 

22165 

11 

22166 

23007 

12 

23008 

23094 

13 

23095 

23363 

14 

23364 

23632 

15 

23633 

23901 

16 

23902 

24001 

19 

24002 

24149 

20 

24150 

24297 

21 

24298 

24354 

25 

24355 

24482 

26 

24483 

24610 

31 

24611 

24738 

32 

24739 

24765 

37 

24766 

25118 

(d)  Sphere:  Actual  vs.  optimal  edge-allocation 

nproc 

actual  % 

optimal ( 

2 

50.601 

50.00 

4 

26.567 

25.00 

6 

20.770 

16.67 

(e)  Timings  for  sphere  problem 


nProc  time  (s)  CPU  (s) 


2259 

2204.0 

1050 

2043.2 

532 

2065.0 

430 

2493.3 

loopl 


172048 

172048 

12416 

12416 

14064 

14062 

4832 


80368 

80368 

80368 

80368 

16080 

16080 

16080 

13326 

3760 

3760 

3760 

2048 

1008 

4512 


53584 

53584 

53584 

53584 

53584 

32832 

13472 

13472 

13472 

13472 

13472 

1390 

4304 

4304 

4304 

1600 

2368 

2368 

912 

2048 

2048 

2048 

432 

5648 


loss  % 


1.2 

6.2 

24.6 


Speedup 

1.00 

2.15 

4.25 

5.25 


ipmin  ipmax 


1 

27373 

27377 

54935 

25525 

29268 

53374 

56594 

55413 

58529 

58530 

61039 

57736 

59207 

1 

13171 

13172 

26969 

26970 

40962 

40964 

54771 

11892 

28113 

39029 

55218 

55219 

58651 

58654 

61039 

26232 

28688 

53705 

55676 

57869 

59179 

26788 

55787 

58484 

59308 

54486 

56437 

1 

8960 

8962 

18476 

18478 

28352 

28354 

38356 

38357 

48329 

48330 

54935 

7924 

19449 

26448 

38468 

46448 

55065 

55066 

58148 

58149 

60703 

60705 

61039 

17870 

38498 

53557 

55636 

57295 

58737 

60427 

60914 

36539 

38829 

54246 

55940 

57991 

58859 

36906 

39125 

54651 

56231 

37201 

39439 

54986 

56291 

37506 

40314 

106 


R.  Lohner  /  Comput.  Methods  Appl.  Mech.  Engrg.  163  (1998)  95-109 


Table  4 

ipasg  min  (loopl)  max  (loopl) 


(a)  Inlet  problem  on  2  processors 


1-2 

237776 

237776 

3-4 

71344 

71344 

5-6 

21392 

21392 

7-8 

6416 

6416 

9-10 

1770 

1936 

11-12 

112 

1024 

13-14 

0 

576 

(b)  Inlet  problem  on  4  processors 

1-4 

118880 

118880 

5-8 

35664 

35664 

9-12 

10704 

10704 

13-16 

3216 

3216 

17-20 

266 

1024 

21-24 

800 

1024 

25-28 

0 

256 

(c)  Inlet  problem  on  6  processors 

1-6 

79248 

79248 

7-12 

23776 

23776 

13-18 

7136 

7136 

19-24 

0 

2144 

25-30 

1024 

1024 

31-36 

0 

1024 

37-42 

0 

336 

(d)  Inlet  problem  on  8  processors 

1-8 

59440 

59440 

9-16 

17824 

17824 

17-24 

5360 

5360 

25-32 

650 

1600 

33-40 

0 

1024 

41-48 

0 

1024 

49-56 

0 

288 

(e)  Sphere:  Actual  vs.  optimal  edge- allocation 

nproc 

actual  % 

optimal  % 

loss  % 

2 

50.122 

50.00 

0.24 

4 

25.140 

25.00 

0.56 

6 

16.884 

16.67 

1.01 

8 

12.743 

12.50 

1.94 

(4f):  Speedups  for  inlet  problem 

nproc 

Speedup  (Shared) 

Speedup  (PVM) 

2 

1.81 

1.83 

4 

3.18 

3.50 

6 

4.31 

5.10 

8 

5.28 

_ 

6.3.  Supersonic  inlet 


The  surface  definition,  as  well  as  the  solution  after  800  timesteps  are  shown  in  Fig.  7(a,b).  The  mesh  had 
approximately  540  Ktetra,  106  Kpts,  30  Kboundary  points  and  680  Kedges.  After  renumbering,  the  maximum 
and  average  bandwidths  were  1057  and  468.  The  vector-loop  length  was  again  set  to  16.  Instead  of  showing 
detailed  tables  as  before,  only  the  minimum  and  maximum  number  of  edges  processed  for  each  pass  over  the 
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Surface  Definition 


Density 


FEFLQ97 


Shared  Memory 
Distributed  Memory  (PVM) 


NELEM=»542895 
NPOIN=106285 
NBOUN*  30094 
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/ 
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NITERs  3 
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ltr  TMT  -  O 

L. 

'7 
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MACH  *  3.0 

7 

» 

Processors 

Fig.  7.  Inlet  problem  (supersonic),  (a)  Surface  definition;  (b)  density;  (c)  speedups. 
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processors  is  given  in  Table  4(a-d).  The  corresponding  percentage  of  edges  processed  by  the  first  processor,  as 
well  as  the  theoretical  loss  of  performance  due  to  imbalance  is  shown  in  Table  4(e).  As  compared  to  the 
previous  two  examples,  a  near  optimal  load  balance  is  achieved.  Two  reasons  may  be  given  for  this 
improvement.  First,  this  example  was  run  with  a  newer  version  of  the  renumbering  techniques.  Second,  the 
constraint  of  monotonicity  in  the  range  of  points  covered  by  each  macro-group  of  edges  was  relaxed  to  a 
non-overlap  at  the  cache-line  level,  which  for  the  SGI  Power  Challenge  is  about  15  reals.  Table  4(f)  and  Fig. 
7(c)  summarize  the  speed-ups  obtained  for  a  run  of  100  timesteps,  including  i/o,  renumbering,  etc.  for  an  SGI 
Origin  2000  machine.  Although  this  machine  is  not  a  true  shared-memory  machine,  it  exhibits  very  fast 
inter-processor  transfer  rates,  making  it  possible  to  achieve  reasonable  speedups  in  shared  memory  mode.  The 
same  run  was  repeated  using  domain  decomposition  and  message  passing  under  PVM.  Observe  that  although  the 
PVM-run  achieves  better  speedup,  the  shared  memory  run  is  still  competitive. 


7.  Conclusions 

Two  renumbering  strategies  for  field  solvers  based  on  unstructured  grids  that  operate  on  shared-memory, 
cache-based  parallel  machines  have  been  described.  Special  attention  was  given  to  the  avoidance  of  cache-line 
overwrite,  a  hitherto  not  considered  design  requirement,  which,  if  not  taken  into  account,  can  lead  to  drastic 
performance  degradation  on  this  type  of  machine.  Both  renumbering  techniques  avoid  cache-misses  and 
cache-line  overwrite  while  allowing  pipelining,  leading  to  optimal  coding.  While  the  first  technique  requires  no 
code  rewrite  of  the  field  solver,  its  scalability  is  expected  to  degrade  for  a  large  number  of  processors.  The 
second  technique  requires  a  moderate  rewrite  of  traditional  field  solvers,  but  offers  the  potential  of  near-linear 
scalability  for  a  large  number  of  processors  and  problem  sizes.  Numerical  experiments  indicate  that  with  these 
renumbering  techniques,  the  number  of  passes  over  the  processors  is  always  below  the  theoretical  minimum 
number  of  passes  one  would  require  for  maximum-length  loops,  which  for  tetrahedral  meshes  is  7.  This  implies 
that  these  techniques  are  also  applicable  to  static  memory  machines  like  the  CRAY-T90,  reducing  loop  start-up 
costs  and  improving  performance  as  compared  to  straightforward  inner  loop  autotasking.  As  with  any  other 
technique,  improvements  and  variations  are  possible.  The  techniques  described  will,  however,  work  on  any 
shared-memory,  cache-based  machine,  and  are  in  this  sense  general. 
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